AMES IOWA HOUSE PRICE PREDICTION¶

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

Acknowledge The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset.

In [1]:
# This is Ames Housing Project. The goal is to predict residential house price in Ames, Iowa.
# We have 79 explanatory variables to use as our predictor
# Load all necessary modules
import pandas as pd # pandas module
import numpy as np # numpy module
import matplotlib.pyplot as plt # matplotlib module
import seaborn as sns # seaborn module
In [2]:
# load the data 
ameHouse = pd.read_csv("train.csv")
ameHouse_test = pd.read_csv("test.csv")
In [3]:
ameHouse.head()
Out[3]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

In [4]:
ameHouse_test.head()
Out[4]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub ... 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub ... 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN NaN 0 6 2010 WD Normal
4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub ... 144 0 NaN NaN NaN 0 1 2010 WD Normal

5 rows × 80 columns

In [5]:
# check for na value
null_ames = [col for col in ameHouse.columns if ameHouse[col].isnull().sum() > 0]
ameHouse[null_ames].isnull().sum().sort_values(ascending=False)/len(ameHouse)
Out[5]:
PoolQC          0.995205
MiscFeature     0.963014
Alley           0.937671
Fence           0.807534
FireplaceQu     0.472603
LotFrontage     0.177397
GarageType      0.055479
GarageYrBlt     0.055479
GarageFinish    0.055479
GarageQual      0.055479
GarageCond      0.055479
BsmtExposure    0.026027
BsmtFinType2    0.026027
BsmtFinType1    0.025342
BsmtCond        0.025342
BsmtQual        0.025342
MasVnrArea      0.005479
MasVnrType      0.005479
Electrical      0.000685
dtype: float64
In [6]:
# There are 4 features with missing value ratio more than 50%. 
# PoolQC, MiscFeature, Alley, and Fence
# We need to decide what need to be done upon all of these features.
# PoolQC --> Pool Quality
# MiscFeature ---> Contains feature no covered in other features
# Alley ---> Type of alley access
# Fence ---> Fence quality
# FireplaceQu ---> Fireplace Quality

These 5 high rate missing ratio features, are very rare quality to has among common residential house, so no wonder if they have high numbers of missing value. Because of this reasoning, we can just drop this features

In [7]:
# Drop features with missing value more than 50%
ameHouse = ameHouse.drop(['PoolQC','MiscFeature','Alley','Fence','FireplaceQu'],axis=1)
ameHouse.shape # the ncol now become 76
Out[7]:
(1460, 76)
In [8]:
# Don't forget to drop the same features from test dataset
ameHouse_test = ameHouse_test.drop(['PoolQC','MiscFeature','Alley','Fence','FireplaceQu'], axis=1)
ameHouse_test.shape
Out[8]:
(1459, 75)

Why do we don't drop 'LotFrontage'? The reason is because the missing ratio is relatively tolerable, and based on the documentation, there is chance that this feature usable for the model.

'LotFrontage' definition: Linear feet of street connected to property

In [9]:
# Data summary for numerical features
ameHouse.describe()
Out[9]:
Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 ... WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice
count 1460.000000 1460.000000 1201.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1452.000000 1460.000000 ... 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000
mean 730.500000 56.897260 70.049958 10516.828082 6.099315 5.575342 1971.267808 1984.865753 103.685262 443.639726 ... 94.244521 46.660274 21.954110 3.409589 15.060959 2.758904 43.489041 6.321918 2007.815753 180921.195890
std 421.610009 42.300571 24.284752 9981.264932 1.382997 1.112799 30.202904 20.645407 181.066207 456.098091 ... 125.338794 66.256028 61.119149 29.317331 55.757415 40.177307 496.123024 2.703626 1.328095 79442.502883
min 1.000000 20.000000 21.000000 1300.000000 1.000000 1.000000 1872.000000 1950.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 2006.000000 34900.000000
25% 365.750000 20.000000 59.000000 7553.500000 5.000000 5.000000 1954.000000 1967.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000 2007.000000 129975.000000
50% 730.500000 50.000000 69.000000 9478.500000 6.000000 5.000000 1973.000000 1994.000000 0.000000 383.500000 ... 0.000000 25.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000 2008.000000 163000.000000
75% 1095.250000 70.000000 80.000000 11601.500000 7.000000 6.000000 2000.000000 2004.000000 166.000000 712.250000 ... 168.000000 68.000000 0.000000 0.000000 0.000000 0.000000 0.000000 8.000000 2009.000000 214000.000000
max 1460.000000 190.000000 313.000000 215245.000000 10.000000 9.000000 2010.000000 2010.000000 1600.000000 5644.000000 ... 857.000000 547.000000 552.000000 508.000000 480.000000 738.000000 15500.000000 12.000000 2010.000000 755000.000000

8 rows × 38 columns

In [10]:
# Exploring data distribuiton
ame_05_quan = ameHouse.quantile(0.05)
ame_25_quan = ameHouse.quantile(0.25)
ame_50_quan = ameHouse.quantile(0.50) # same as median
ame_75_quan = ameHouse.quantile(0.75)
ame_95_quan = ameHouse.quantile(0.95)
quantile_ame = pd.DataFrame({'ame_05':ame_05_quan,
                            'ame_25':ame_25_quan,
                            'ame_50':ame_50_quan,
                            'ame_75':ame_75_quan,
                            'ame_95':ame_95_quan})
quantile_ame
Out[10]:
ame_05 ame_25 ame_50 ame_75 ame_95
Id 73.95 365.75 730.5 1095.25 1387.05
MSSubClass 20.00 20.00 50.0 70.00 160.00
LotFrontage 34.00 59.00 69.0 80.00 107.00
LotArea 3311.70 7553.50 9478.5 11601.50 17401.15
OverallQual 4.00 5.00 6.0 7.00 8.00
OverallCond 4.00 5.00 5.0 6.00 8.00
YearBuilt 1916.00 1954.00 1973.0 2000.00 2007.00
YearRemodAdd 1950.00 1967.00 1994.0 2004.00 2007.00
MasVnrArea 0.00 0.00 0.0 166.00 456.00
BsmtFinSF1 0.00 0.00 383.5 712.25 1274.00
BsmtFinSF2 0.00 0.00 0.0 0.00 396.20
BsmtUnfSF 0.00 223.00 477.5 808.00 1468.00
TotalBsmtSF 519.30 795.75 991.5 1298.25 1753.00
1stFlrSF 672.95 882.00 1087.0 1391.25 1831.25
2ndFlrSF 0.00 0.00 0.0 728.00 1141.05
LowQualFinSF 0.00 0.00 0.0 0.00 0.00
GrLivArea 848.00 1129.50 1464.0 1776.75 2466.10
BsmtFullBath 0.00 0.00 0.0 1.00 1.00
BsmtHalfBath 0.00 0.00 0.0 0.00 1.00
FullBath 1.00 1.00 2.0 2.00 2.00
HalfBath 0.00 0.00 0.0 1.00 1.00
BedroomAbvGr 2.00 2.00 3.0 3.00 4.00
KitchenAbvGr 1.00 1.00 1.0 1.00 1.00
TotRmsAbvGrd 4.00 5.00 6.0 7.00 10.00
Fireplaces 0.00 0.00 1.0 1.00 2.00
GarageYrBlt 1930.00 1961.00 1980.0 2002.00 2007.00
GarageCars 0.00 1.00 2.0 2.00 3.00
GarageArea 0.00 334.50 480.0 576.00 850.10
WoodDeckSF 0.00 0.00 0.0 168.00 335.00
OpenPorchSF 0.00 0.00 25.0 68.00 175.05
EnclosedPorch 0.00 0.00 0.0 0.00 180.15
3SsnPorch 0.00 0.00 0.0 0.00 0.00
ScreenPorch 0.00 0.00 0.0 0.00 160.00
PoolArea 0.00 0.00 0.0 0.00 0.00
MiscVal 0.00 0.00 0.0 0.00 0.00
MoSold 2.00 5.00 6.0 8.00 11.00
YrSold 2006.00 2007.00 2008.0 2009.00 2010.00
SalePrice 88000.00 129975.00 163000.0 214000.00 326100.00
In [11]:
# Let's try to visualize SalePrice distribution (Because it is our target value, there are must be something we can learn)
fig, axSales = plt.subplots(1,2,figsize=(10,4))
sns.histplot(ax=axSales[0],x='SalePrice',data=ameHouse).set_title("Histogram of SalePrice")
sns.boxplot(ax=axSales[1],y = 'SalePrice', data=ameHouse, color='red').set_title("Boxplot of SalePrice")
plt.suptitle("Distribution of SalePrice", fontweight = 'bold')
plt.subplots_adjust(wspace=0.4)
plt.show()
In [12]:
# The plot  become clear indication of outliers existence in SalePrice
# We can make it more form by doing normality test using qq plot
import statsmodels.api as sm

sm.qqplot(ameHouse['SalePrice'],line='45')
plt.show()

'SalePrice' data cleary do not follow 45 degree line, which is a strong indication that they don't follow normal distribution.

In [13]:
# All numerical features
ames_numeric=ameHouse.columns[(ameHouse.dtypes=='int64') | (ameHouse.dtypes=='float64')]

Understanding The Features¶

To get better sense about ames housing dataset, i decide to analyze all feature individually.

Id¶

There are no depper meaning of it. Use it as index or identity of each observations.

MSSubClass¶

Indentifies the type of dwelling involved in the sale; basically, it tells us about the residential house type The value is consist of discrete number, range from 20 up to 190. Here some definition of each number :

  • 20==>1-STORY 1946 & NEWER ALL STYLES
  • 30==>1-STORY 1945 & OLDER
  • 40==>1-STORY W/FINISHED ATTIC ALL AGES
  • 45==>1-1/2 STORY - UNFINISHED ALL AGES
  • 50==>1-1/2 STORY FINISHED ALL AGES
  • 60==>2-STORY 1946 & NEWER
  • 70==>2-STORY 1945 & OLDER

etc...

This feature is something that i called 'half' numerical feature, because it doesn't really represent any measurement, instead a set of quantified quality.

In [14]:
# Distribution of MSSubClass
fig, axMSClass = plt.subplots(1,2, figsize=(10,4))
sns.histplot(ax=axMSClass[0],x='MSSubClass',data=ameHouse, bins=16).set_title('MSSubClass in Hist')
sns.countplot(ax=axMSClass[1],x='MSSubClass', data=ameHouse).set_title('MSSubClass in Barplot')
plt.show()

The two plots above show that the MSSubClass feature is more worthy of being treated as a categorical feature

In [15]:
# let's drop MSSubClass from ames_numeric
#ames_numeric = ames_numeric.drop('MSSubClass')
ames_numeric
Out[15]:
Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
       'MiscVal', 'MoSold', 'YrSold', 'SalePrice'],
      dtype='object')

LotFrontage¶

I've discussed this feature a bit before. Here I rewrite the definition based on the data documentation Linear feet of street connected to property. Based on gimme-shelter, Frontage is "the width of the lot, measured at front part of the lot". When it comes to real estate, bigger frontage means more land, and more land means capacity to bigger house. So,logically speaking, this feature is necessary to predict house selling price.

In [16]:
# For future simplicity, i create a function called numDEA() to return all statistic summary of the feature
def numDEA(col,  df):
    stat_summary = pd.DataFrame({
        'Mean' : round(df[col].mean(),2),
        'Median' : round(df[col].median(),2),
        'Mode': df[col].mode()[0],
        'std' : round(df[col].std(),2),
        'Min': df[col].min(),
        'Max': df[col].max(),
        'Range': df[col].max() - df[col].min(),
        '5%': df[col].quantile(0.05),
        '25%': df[col].quantile(0.25),
        '50%': df[col].quantile(0.50),
        '75%': df[col].quantile(0.75),
        '90%': df[col].quantile(0.9),
        'IQR' : df[col].quantile(0.75) - df.quantile(0.25),
        'Count' : df[col].count(),
        'Unique': df[col].nunique(),
        'Missing Value' : df[col].isnull().sum()  
    }, index = [col])
    return stat_summary
In [17]:
# LotFrontage's Statistical Summary

numDEA('LotFrontage',df= ameHouse)
Out[17]:
Mean Median Mode std Min Max Range 5% 25% 50% 75% 90% IQR Count Unique Missing Value
LotFrontage 70.05 69.0 60.0 24.28 21.0 313.0 292.0 34.0 59.0 69.0 80.0 96.0 21.0 1201 110 259

From central tendency, we can deduce that LotFrontage feature is right-skewed data, which indicate that most of the residential house in Ames, Iowa has frontage around 69 feet (21,03 meter), more or less. There are huge difference between 90% of house frontage and maximum frontage (313 - 96 = 217), which show us the existence of small group of "elite" house with wide frontage.

In [18]:
# Distribution of LotFrontage
sns.histplot(x='LotFrontage',data=ameHouse).set_title('LotFrontage in Hist')
plt.show()
In [19]:
sns.relplot(x='LotFrontage',y='SalePrice',data=ameHouse)
plt.title("SalePrice and LotFrontage")
plt.show()

The scatterplot above gives us signs of a positive correlation between LotFrontage and SalePrice, albeit a bit of an anomaly. This anomaly give some indication of other factors influences SalePrice that i will investigate further.

LotArea¶

From data documentation, 'LotArea' is Lot size in square feet.

In [20]:
# Statistical Summary of LotArea
numDEA('LotArea',df= ameHouse)
Out[20]:
Mean Median Mode std Min Max Range 5% 25% 50% 75% 90% IQR Count Unique Missing Value
LotArea 10516.83 9478.5 7200 9981.26 1300 215245 213945 3311.7 7553.5 9478.5 11601.5 14381.7 4048.0 1460 1073 0
In [21]:
# LotArea Visualization
fig_LotArea, axLotArea = plt.subplots(1,2, figsize=(15,5))
sns.histplot(ax=axLotArea[0], x= 'LotArea',data=ameHouse).set_title("LotArea Distribution")
sns.scatterplot(ax=axLotArea[1],x='LotArea',y='SalePrice',data=ameHouse).set_title("SalePrice vs LotAreab")
plt.suptitle("LotArea Distribution and Relationship", fontweight="bold")
plt.show()

Same as LotFrontage, LotArea is Right-Skewed. Majority of house in Ames, Iowa is small to medium house, while there are small fraction of house that has lotArea from 14.000 feet square (4.000 m2) up to 215.000 feet square (65.532 m2). Because of the scale, scatterplot can not depict the relationship of LotArea and SalePrice really well. To fix it, i transform these two features with natural log. After transformation, the pattern start to show.

In [22]:
viz_only_lotArea = pd.DataFrame({
    'nl_SalePrice':np.log(ameHouse['SalePrice']),
    'nl_LotArea':np.log(ameHouse['LotArea'])
})

sns.scatterplot(x='nl_LotArea',y='nl_SalePrice',data=viz_only_lotArea).set_title("Natural Log SalePrice vs Natural Log LotArea")
plt.show()

OverallQual¶

Based on data documentation, 'OverallQual' is Rates the overall material and finish of the house. The values are representation of a quality measure, such as following: 10 --- > Very Excellent 9 ---> Excellent 8 ---> Very Good 7 ---> Good 6 ---> Above Average 5 ---> Average 4 ---> Below Average 3 ---> Fair 2 ---> Poor 1 ---> Very Poor

Because it has fixed set of values, we should treat 'OverallQual' as categorical variable, specifically, in Ordinal level measurement.

In [23]:
# Drop OverallQual from ames_numeric
ames_numeric =ames_numeric.drop('OverallQual')
ames_numeric
Out[23]:
Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
       'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces',
       'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
       'MoSold', 'YrSold', 'SalePrice'],
      dtype='object')
In [24]:
# OverallQual Exploration
# Frequency

countplot_overall = pd.DataFrame({
    'Count': ameHouse['OverallQual'].value_counts(),
    'Percent(%)':round((ameHouse['OverallQual'].value_counts()/len(ameHouse['OverallQual']))*100,2)
})
print("The mode of this feature is : {}".format(ameHouse['OverallQual'].mode()[0]))
print(countplot_overall)
The mode of this feature is : 5
    Count  Percent(%)
5     397       27.19
6     374       25.62
7     319       21.85
8     168       11.51
4     116        7.95
9      43        2.95
3      20        1.37
10     18        1.23
2       3        0.21
1       2        0.14
In [25]:
sns.countplot(x='OverallQual', data=ameHouse).set_title('OverallQual')
plt.show()

Half of residential housing in Ames, Iowa (52,81%) has average to above average quality, with only 9,67% from total with below average to very poor quality housing. Based on this, i think it safe to deduce that Ames, Iowa is a good environment to live.

In [26]:
# SalePrice and OverallQual
sns.boxplot(x= 'OverallQual',y='SalePrice', data=ameHouse)
plt.title("OverallQual vs SalePrice")
plt.show()

OverallQual able to divide SalePrice distribution rather clearly. It shows how each quality has distintive range of SalePrice, and it indicate that OverallQual is a good feature to predict SalePrice.

OverallCond¶

From data documentation, OverallCond is Rates the overall condition of the house. More or less, it is similiar to OverallQual. The set of values are :

  • 10 Very Excellent
  • 9 Excellent
  • 8 Very Good
  • 7 Good
  • 6 Above Average
  • 5 Average
  • 4 Below Average
  • 3 Fair
  • 2 Poor
  • 1 Very Poor
In [27]:
# let's drop it from  ames_numeric
ames_numeric = ames_numeric.drop('OverallCond')
ames_numeric
Out[27]:
Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'YearBuilt',
       'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
       'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
       'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt',
       'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
       'MoSold', 'YrSold', 'SalePrice'],
      dtype='object')
In [28]:
# OverallCond Data Exploration
allCond_count = pd.DataFrame({
    'Count': ameHouse['OverallCond'].value_counts(),
    'Percent(%)':round((ameHouse['OverallCond'].value_counts()/len(ameHouse['OverallCond'])*100),2)
})
print("The mode of this feature is : {}".format(ameHouse['OverallCond'].mode()[0]))
print(allCond_count)
The mode of this feature is : 5
   Count  Percent(%)
5    821       56.23
6    252       17.26
7    205       14.04
8     72        4.93
4     57        3.90
3     25        1.71
9     22        1.51
2      5        0.34
1      1        0.07
In [29]:
# OverallCond Visualization
figCond, axCond = plt.subplots(1,2, figsize=(15,5))
sns.countplot(ax=axCond[0],x='OverallCond',data=ameHouse).set_title("Frequency of OverallCond")
sns.boxplot(ax=axCond[1],x='OverallCond',y='SalePrice',data=ameHouse).set_title("OverallCond vs SalePrice")
plt.show()

Different from OverallQual that able to divide SalePrice relatively well, OverallCond seems to have various relationship pattern to SalePrice. It shows by the boxplot, that house with "average" condition has wide range of SalePrice. Moreover, there are a anomaly where "poor" condition house has higher price than house with better condition. This encourage an assumption that maybe these pattern caused by weak correlation between features, or there are feature with stronger effect involved.

YearBuilt and YearRemodAdd¶

Based on data documentation, YearBuilt is Original construction date YearRemodAdd is Remodel date (same as construction date if no remodeling or additions) The values only specified in year, so nothing much we can do except to ensure there are not any null value, or miss entry

In [30]:
print('''
Missing value : 
{}'''.format(ameHouse[['YearBuilt','YearRemodAdd']].isnull().sum()))
Missing value : 
YearBuilt       0
YearRemodAdd    0
dtype: int64

MasVnrArea¶

Based on Data Documentation, MasVnrArea is Masonry veneer area in square feet Masonry veneer itself is single non-structural external layer of masonry, typically made of brick, stone or manufactured stone(source:Wikipedia)

In [31]:
numDEA('MasVnrArea',df= ameHouse)
Out[31]:
Mean Median Mode std Min Max Range 5% 25% 50% 75% 90% IQR Count Unique Missing Value
MasVnrArea 103.69 0.0 0.0 181.07 0.0 1600.0 1600.0 0.0 0.0 0.0 166.0 335.0 166.0 1452 327 8
In [32]:
# MasVnArea Visualization
figMasVn, axMasVn = plt.subplots(1,2, figsize=(15,5))
sns.histplot(ax=axMasVn[0], x= 'MasVnrArea',data=ameHouse).set_title("MasVnrArea Distribution")
sns.scatterplot(ax=axMasVn[1],x='MasVnrArea',y='SalePrice',data=ameHouse).set_title("SalePrice vs MasVnrArea")
plt.suptitle("MasVnrArea Distribution and Relationship", fontweight="bold")
plt.show()

This data is a litle tricky. If we look upon it, almost all of house in Ames, Iowa has zero feet square of Masonry Venree. It is unclear whether they simply not use Masonry venree, or it is due to error in data entry

BsmtFinSF1 and BsmtFinSF2¶

Based on data documentation, BsmtFinSF1 is Type 1 finished square feet. It is area of basement in certain type. BsmtFinSF2 is Type 2 finished square feet. These two features give us the same measurement, and the second only exist if a house has multiple type of basement.

In [33]:
print(numDEA('BsmtFinSF1',df=ameHouse))
print(numDEA('BsmtFinSF2',df=ameHouse))
              Mean  Median  Mode    std  Min   Max  Range   5%  25%    50%  \
BsmtFinSF1  443.64   383.5     0  456.1    0  5644   5644  0.0  0.0  383.5   

               75%     90%     IQR  Count  Unique  Missing Value  
BsmtFinSF1  712.25  1065.5  712.25   1460     637              0  
             Mean  Median  Mode     std  Min   Max  Range   5%  25%  50%  75%  \
BsmtFinSF2  46.55     0.0     0  161.32    0  1474   1474  0.0  0.0  0.0  0.0   

              90%  IQR  Count  Unique  Missing Value  
BsmtFinSF2  117.2  0.0   1460     144              0  
In [34]:
# BsmtFinSF1 Visualization
figBsmt1, axBsmt1 = plt.subplots(2,2, figsize=(15,10))
sns.histplot(ax=axBsmt1[0,0], x= 'BsmtFinSF1',data=ameHouse).set_title("BsmtFinSF1 Distribution")
sns.scatterplot(ax=axBsmt1[0,1],x='BsmtFinSF1',y='SalePrice',data=ameHouse).set_title("SalePrice vs BsmtFinSF1")
sns.histplot(ax=axBsmt1[1,0], x= 'BsmtFinSF2',data=ameHouse, color="red").set_title("BsmtFinSF2 Distribution")
sns.scatterplot(ax=axBsmt1[1,1],x='BsmtFinSF2',y='SalePrice',data=ameHouse,color="red").set_title("SalePrice vs BsmtFinSF2")
plt.suptitle("BsmtFinSF1 Distribution and Relationship", fontweight="bold")
plt.subplots_adjust(wspace=0.4)
plt.show()

If we ignore observation with zero Basement Area, there is a distinct pattern which indicate positive relationship between Basement Area type 1.But, in BsmtFinSF2, there are no clear relationship with SalePrice. BsmtFinSF2 also has smaller value range with more than 90% of its value is 0.

BsmtUnfSF¶

Unfinished square feet of basement area

In [35]:
numDEA('BsmtUnfSF',ameHouse)
Out[35]:
Mean Median Mode std Min Max Range 5% 25% 50% 75% 90% IQR Count Unique Missing Value
BsmtUnfSF 567.24 477.5 0 441.87 0 2336 2336 0.0 223.0 477.5 808.0 1232.0 585.0 1460 780 0
In [36]:
figBsmtUnf, axBsmtUnf = plt.subplots(1,2, figsize=(15,5))
sns.histplot(ax=axBsmtUnf[0], x= 'BsmtUnfSF',data=ameHouse).set_title("BsmtUnfSF Distribution")
sns.scatterplot(ax=axBsmtUnf[1],x='BsmtUnfSF',y='SalePrice',data=ameHouse).set_title("SalePrice vs BsmtUnfSF")
Out[36]:
Text(0.5, 1.0, 'SalePrice vs BsmtUnfSF')

There are no clear pattern bertween BsmtUnfSF and SalePrice

In [37]:
# There are quite a lot of variables here. So, let's just speed things up
# Within all ames_numeric,  we identify which one has smaller cardinality, and check it up to documentation, and decide whether 
# they are numeric or ordinal/categorical variable
# low cardinality, abritrary, under or equal 15
ames_numeric_low_car = [col for col in ames_numeric if ameHouse[col].nunique() <= 15]
ameHouse[ames_numeric_low_car].nunique()
Out[37]:
MSSubClass      15
BsmtFullBath     4
BsmtHalfBath     3
FullBath         4
HalfBath         3
BedroomAbvGr     8
KitchenAbvGr     4
TotRmsAbvGrd    12
Fireplaces       4
GarageCars       5
PoolArea         8
MoSold          12
YrSold           5
dtype: int64
In [38]:
figMisc, axMisc = plt.subplots(2,3, figsize=(15,10))
sns.boxplot(ax=axMisc[0,0],x='MSSubClass',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axMisc[0,1],x='BsmtFullBath',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axMisc[0,2],x='BsmtHalfBath',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axMisc[1,0],x='FullBath',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axMisc[1,1],x='HalfBath',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axMisc[1,2],x='Fireplaces',y="SalePrice",data=ameHouse)
plt.subplots_adjust(wspace=0.4)
plt.show()

Based on the feature definition, all of this low cardinality feature (except for MSSubClass) are numerical discrete value. They represents a number of facility exists in a house. But, because of their value nature, the proper treatment is to group them together with categorical features.

In [39]:
# Separate high cardinality numeric feature from low cardinality numeric feature
ames_numeric_high_car = [col for col in ames_numeric if col not in ames_numeric_low_car]
ames_numeric_high_car
Out[39]:
['Id',
 'LotFrontage',
 'LotArea',
 'YearBuilt',
 'YearRemodAdd',
 'MasVnrArea',
 'BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtUnfSF',
 'TotalBsmtSF',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'GrLivArea',
 'GarageYrBlt',
 'GarageArea',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 '3SsnPorch',
 'ScreenPorch',
 'MiscVal',
 'SalePrice']
In [40]:
# call all statistic summary for ames numeric feature with high cardinality
for i in ames_numeric_high_car:
    print(numDEA(i,df=ameHouse),end="\n\n")
     Mean  Median  Mode     std  Min   Max  Range     5%     25%    50%  \
Id  730.5   730.5     1  421.61    1  1460   1459  73.95  365.75  730.5   

        75%     90%    IQR  Count  Unique  Missing Value  
Id  1095.25  1314.1  729.5   1460    1460              0  

              Mean  Median  Mode    std   Min    Max  Range    5%   25%   50%  \
LotFrontage  70.05    69.0  60.0  24.28  21.0  313.0  292.0  34.0  59.0  69.0   

              75%   90%   IQR  Count  Unique  Missing Value  
LotFrontage  80.0  96.0  21.0   1201     110            259  

             Mean  Median  Mode      std   Min     Max   Range      5%  \
LotArea  10516.83  9478.5  7200  9981.26  1300  215245  213945  3311.7   

            25%     50%      75%      90%     IQR  Count  Unique  \
LotArea  7553.5  9478.5  11601.5  14381.7  4048.0   1460    1073   

         Missing Value  
LotArea              0  

              Mean  Median  Mode   std   Min   Max  Range      5%     25%  \
YearBuilt  1971.27  1973.0  2006  30.2  1872  2010    138  1916.0  1954.0   

              50%     75%     90%   IQR  Count  Unique  Missing Value  
YearBuilt  1973.0  2000.0  2006.0  46.0   1460     112              0  

                 Mean  Median  Mode    std   Min   Max  Range      5%     25%  \
YearRemodAdd  1984.87  1994.0  1950  20.65  1950  2010     60  1950.0  1967.0   

                 50%     75%     90%   IQR  Count  Unique  Missing Value  
YearRemodAdd  1994.0  2004.0  2006.0  37.0   1460      61              0  

              Mean  Median  Mode     std  Min     Max   Range   5%  25%  50%  \
MasVnrArea  103.69     0.0   0.0  181.07  0.0  1600.0  1600.0  0.0  0.0  0.0   

              75%    90%    IQR  Count  Unique  Missing Value  
MasVnrArea  166.0  335.0  166.0   1452     327              8  

              Mean  Median  Mode    std  Min   Max  Range   5%  25%    50%  \
BsmtFinSF1  443.64   383.5     0  456.1    0  5644   5644  0.0  0.0  383.5   

               75%     90%     IQR  Count  Unique  Missing Value  
BsmtFinSF1  712.25  1065.5  712.25   1460     637              0  

             Mean  Median  Mode     std  Min   Max  Range   5%  25%  50%  75%  \
BsmtFinSF2  46.55     0.0     0  161.32    0  1474   1474  0.0  0.0  0.0  0.0   

              90%  IQR  Count  Unique  Missing Value  
BsmtFinSF2  117.2  0.0   1460     144              0  

             Mean  Median  Mode     std  Min   Max  Range   5%    25%    50%  \
BsmtUnfSF  567.24   477.5     0  441.87    0  2336   2336  0.0  223.0  477.5   

             75%     90%    IQR  Count  Unique  Missing Value  
BsmtUnfSF  808.0  1232.0  585.0   1460     780              0  

                Mean  Median  Mode     std  Min   Max  Range     5%     25%  \
TotalBsmtSF  1057.43   991.5     0  438.71    0  6110   6110  519.3  795.75   

               50%      75%     90%    IQR  Count  Unique  Missing Value  
TotalBsmtSF  991.5  1298.25  1602.2  502.5   1460     721              0  

             Mean  Median  Mode     std  Min   Max  Range      5%    25%  \
1stFlrSF  1162.63  1087.0   864  386.59  334  4692   4358  672.95  882.0   

             50%      75%     90%     IQR  Count  Unique  Missing Value  
1stFlrSF  1087.0  1391.25  1680.0  509.25   1460     753              0  

            Mean  Median  Mode     std  Min   Max  Range   5%  25%  50%  \
2ndFlrSF  346.99     0.0     0  436.53    0  2065   2065  0.0  0.0  0.0   

            75%    90%    IQR  Count  Unique  Missing Value  
2ndFlrSF  728.0  954.2  728.0   1460     417              0  

              Mean  Median  Mode    std  Min  Max  Range   5%  25%  50%  75%  \
LowQualFinSF  5.84     0.0     0  48.62    0  572    572  0.0  0.0  0.0  0.0   

              90%  IQR  Count  Unique  Missing Value  
LowQualFinSF  0.0  0.0   1460      24              0  

              Mean  Median  Mode     std  Min   Max  Range     5%     25%  \
GrLivArea  1515.46  1464.0   864  525.48  334  5642   5308  848.0  1129.5   

              50%      75%     90%     IQR  Count  Unique  Missing Value  
GrLivArea  1464.0  1776.75  2158.3  647.25   1460     861              0  

                Mean  Median    Mode    std     Min     Max  Range      5%  \
GarageYrBlt  1978.51  1980.0  2005.0  24.69  1900.0  2010.0  110.0  1930.0   

                25%     50%     75%     90%   IQR  Count  Unique  \
GarageYrBlt  1961.0  1980.0  2002.0  2006.0  41.0   1379      97   

             Missing Value  
GarageYrBlt             81  

              Mean  Median  Mode    std  Min   Max  Range   5%    25%    50%  \
GarageArea  472.98   480.0     0  213.8    0  1418   1418  0.0  334.5  480.0   

              75%    90%    IQR  Count  Unique  Missing Value  
GarageArea  576.0  757.1  241.5   1460     441              0  

             Mean  Median  Mode     std  Min  Max  Range   5%  25%  50%  \
WoodDeckSF  94.24     0.0     0  125.34    0  857    857  0.0  0.0  0.0   

              75%    90%    IQR  Count  Unique  Missing Value  
WoodDeckSF  168.0  262.0  168.0   1460     274              0  

              Mean  Median  Mode    std  Min  Max  Range   5%  25%   50%  \
OpenPorchSF  46.66    25.0     0  66.26    0  547    547  0.0  0.0  25.0   

              75%    90%   IQR  Count  Unique  Missing Value  
OpenPorchSF  68.0  130.0  68.0   1460     202              0  

                Mean  Median  Mode    std  Min  Max  Range   5%  25%  50%  \
EnclosedPorch  21.95     0.0     0  61.12    0  552    552  0.0  0.0  0.0   

               75%    90%  IQR  Count  Unique  Missing Value  
EnclosedPorch  0.0  112.0  0.0   1460     120              0  

           Mean  Median  Mode    std  Min  Max  Range   5%  25%  50%  75%  \
3SsnPorch  3.41     0.0     0  29.32    0  508    508  0.0  0.0  0.0  0.0   

           90%  IQR  Count  Unique  Missing Value  
3SsnPorch  0.0  0.0   1460      20              0  

              Mean  Median  Mode    std  Min  Max  Range   5%  25%  50%  75%  \
ScreenPorch  15.06     0.0     0  55.76    0  480    480  0.0  0.0  0.0  0.0   

             90%  IQR  Count  Unique  Missing Value  
ScreenPorch  0.0  0.0   1460      76              0  

          Mean  Median  Mode     std  Min    Max  Range   5%  25%  50%  75%  \
MiscVal  43.49     0.0     0  496.12    0  15500  15500  0.0  0.0  0.0  0.0   

         90%  IQR  Count  Unique  Missing Value  
MiscVal  0.0  0.0   1460      21              0  

               Mean    Median    Mode      std    Min     Max   Range  \
SalePrice  180921.2  163000.0  140000  79442.5  34900  755000  720100   

                5%       25%       50%       75%       90%      IQR  Count  \
SalePrice  88000.0  129975.0  163000.0  214000.0  278000.0  84025.0   1460   

           Unique  Missing Value  
SalePrice     663              0  

From this summary, i find that, even in this high cardinality features, there are still few with high zero value percentage, like 'MasVnrArea', 'BsmtFinSF2','2ndFlrSF','WoodDeckSF',and 'ScreenPorch'.

In [41]:
# Let's check if if these features still worth to keep
figMisc2, axMisc2 = plt.subplots(2,3, figsize=(15,10))
sns.regplot(ax=axMisc2[0,0],x='MasVnrArea',y="SalePrice",data=ameHouse)
sns.regplot(ax=axMisc2[0,1],x='BsmtFinSF2',y="SalePrice",data=ameHouse)
sns.regplot(ax=axMisc2[0,2],x='2ndFlrSF',y="SalePrice",data=ameHouse)
sns.regplot(ax=axMisc2[1,0],x='WoodDeckSF',y="SalePrice",data=ameHouse)
sns.regplot(ax=axMisc2[1,1],x='ScreenPorch',y="SalePrice",data=ameHouse)
plt.subplots_adjust(wspace=0.4)
plt.show()

Aside of 'MasVnrArea', all of these features doesn't seem worth to keep around. The rarity itself is a problem. Actually, looking back, all features with high zero percentage value is caused by rarity factor; it came back to the fact we found early about the existence of some elite house. It makes sense that not all house has pool, or wood deck,or 3 Season Porch, right?

In [42]:
# Keep all feature as numeric feature, except for MSSubClass
#ames_numeric = ames_numeric.drop('MSSubClass')
len(ames_numeric)
Out[42]:
36
In [43]:
# Analysis on Categorical Feature 
ames_cat = [col for col in ameHouse.columns if ameHouse[col].dtypes=='object']
print(len(ames_cat))
ameHouse[ames_cat].nunique()
38
Out[43]:
MSZoning          5
Street            2
LotShape          4
LandContour       4
Utilities         2
LotConfig         5
LandSlope         3
Neighborhood     25
Condition1        9
Condition2        8
BldgType          5
HouseStyle        8
RoofStyle         6
RoofMatl          8
Exterior1st      15
Exterior2nd      16
MasVnrType        4
ExterQual         4
ExterCond         5
Foundation        6
BsmtQual          4
BsmtCond          4
BsmtExposure      4
BsmtFinType1      6
BsmtFinType2      6
Heating           6
HeatingQC         5
CentralAir        2
Electrical        5
KitchenQual       4
Functional        7
GarageType        6
GarageFinish      3
GarageQual        5
GarageCond        5
PavedDrive        3
SaleType          9
SaleCondition     6
dtype: int64
In [44]:
# Check missing value
ameHouse[ames_cat].isnull().sum()
Out[44]:
MSZoning          0
Street            0
LotShape          0
LandContour       0
Utilities         0
LotConfig         0
LandSlope         0
Neighborhood      0
Condition1        0
Condition2        0
BldgType          0
HouseStyle        0
RoofStyle         0
RoofMatl          0
Exterior1st       0
Exterior2nd       0
MasVnrType        8
ExterQual         0
ExterCond         0
Foundation        0
BsmtQual         37
BsmtCond         37
BsmtExposure     38
BsmtFinType1     37
BsmtFinType2     38
Heating           0
HeatingQC         0
CentralAir        0
Electrical        1
KitchenQual       0
Functional        0
GarageType       81
GarageFinish     81
GarageQual       81
GarageCond       81
PavedDrive        0
SaleType          0
SaleCondition     0
dtype: int64

Here i have 38 features treated as categorical. In one glance, all missing value is caused by natural condition (no garage, no basement, etc). I build some visualization to make this rather presentable

In [45]:
figCat, axCat = plt.subplots(3,3, figsize=(15,15))
sns.countplot(ax=axCat[0,0],x='MSZoning',data=ameHouse)
sns.countplot(ax=axCat[0,1],x='Street',data=ameHouse)
sns.countplot(ax=axCat[0,2],x='LotShape',data=ameHouse)
sns.countplot(ax=axCat[1,0],x='LandContour',data=ameHouse)
sns.countplot(ax=axCat[1,1],x='Utilities',data=ameHouse)
sns.countplot(ax=axCat[1,2],x='LotConfig',data=ameHouse)
sns.countplot(ax=axCat[2,0],x='LandSlope',data=ameHouse)
sns.countplot(ax=axCat[2,1],x='Neighborhood',data=ameHouse)
sns.countplot(ax=axCat[2,2],x='Condition1',data=ameHouse)
plt.subplots_adjust(wspace=0.4)
plt.show()
In [46]:
figCat, axCat = plt.subplots(3,3, figsize=(15,15))
sns.boxplot(ax=axCat[0,0],x='MSZoning',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axCat[0,1],x='Street',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axCat[0,2],x='LotShape',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axCat[1,0],x='LandContour',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axCat[1,1],x='Utilities',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axCat[1,2],x='LotConfig',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axCat[2,0],x='LandSlope',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axCat[2,1],x='Neighborhood',y="SalePrice",data=ameHouse)
sns.boxplot(ax=axCat[2,2],x='Condition1',y="SalePrice",data=ameHouse)
plt.subplots_adjust(wspace=0.4)
plt.show()

Judging from the distribution and count value per category, almost all of nine categoricals we visualized has unbalanced proportion, dominated by certain group.

In [47]:
### Preprocessing Data
# As we start to pre processing data, it's time to split the training dataset
# Import train_test_split module
from sklearn.model_selection import train_test_split
In [48]:
# separate target variable from the rest
# in this project, the target is SalePrice
X = ameHouse.copy(deep=True)
y = X.pop("SalePrice")
In [49]:
X.head()
Out[49]:
Id MSSubClass MSZoning LotFrontage LotArea Street LotShape LandContour Utilities LotConfig ... OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SaleType SaleCondition
0 1 60 RL 65.0 8450 Pave Reg Lvl AllPub Inside ... 61 0 0 0 0 0 2 2008 WD Normal
1 2 20 RL 80.0 9600 Pave Reg Lvl AllPub FR2 ... 0 0 0 0 0 0 5 2007 WD Normal
2 3 60 RL 68.0 11250 Pave IR1 Lvl AllPub Inside ... 42 0 0 0 0 0 9 2008 WD Normal
3 4 70 RL 60.0 9550 Pave IR1 Lvl AllPub Corner ... 35 272 0 0 0 0 2 2006 WD Abnorml
4 5 60 RL 84.0 14260 Pave IR1 Lvl AllPub FR2 ... 84 0 0 0 0 0 12 2008 WD Normal

5 rows × 75 columns

In [50]:
y.head()
Out[50]:
0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64
In [51]:
train_X, val_X, train_y, val_y = train_test_split(X,y,train_size=0.8,test_size=0.2, random_state=0)
In [52]:
print(train_X.shape)
print(val_X.shape)
print(train_y.shape)
print(val_y.shape)
(1168, 75)
(292, 75)
(1168,)
(292,)

Handling Missing Value¶

As we know from previous DEA result, there are some features contain missing value. Some of them has been dropped because of their high percentage of missing value. What left is to deal with the remaining problem.

It's easy to get rid of entire rows of data with missing values, but I don't want to risk losing too much important information.

To do that, i will try to do some reasoning on each feature with missing values, to find out the cause of the missing values. If the cause is natural, then I will remove the row or replace it with a zero value. On the other hand, if the cause of the missing value is an error, then I will do imputation.

In [53]:
# Recall all the missing
miss_col = [col for col in train_X.columns if train_X[col].isnull().sum()>0]
print(len(miss_col))
train_X[miss_col].isnull().sum()
14
Out[53]:
LotFrontage     212
MasVnrType        6
MasVnrArea        6
BsmtQual         28
BsmtCond         28
BsmtExposure     28
BsmtFinType1     28
BsmtFinType2     29
Electrical        1
GarageType       58
GarageYrBlt      58
GarageFinish     58
GarageQual       58
GarageCond       58
dtype: int64
In [54]:
# There are 14 features contains missing value.
# Let's examine each of it
In [55]:
# LotFrontage, as stated before, is the linear feet of street connected to the property
# First, let's think it. Is it possible for a house, a residential house, to not has any street or alley in its front?
# While it still possible in a rural area, it seems not be the case in this dataset.
# To Prove it,there are few features we can use.

# I extract all index with null value of LotFrontage 
miss_lotFrontage = train_X[train_X['LotFrontage'].isnull()].index.tolist()

# Street == Type of road access to property
# This feature consist of two categories, ie 'Gravel' and 'Paved'
sns.countplot(x=train_X.loc[miss_lotFrontage,'Street'])
plt.title("Street Access of house with null value of LotFrontage", fontweight = "bold")
plt.show()
print("Missing value :{}".format(train_X['Street'].isnull().sum())) 
# no missing value, means all houses in our train dataset have access to the road`
# Most of house with null LotFrontage has Pave road connected to it, so it is make no sense. 
Missing value :0
In [56]:
# Now i know that missing value in LotFrontage is caused by error factor. The next step is to do some imputation to fill these
# empty value
# For this case, i want to use median value of LotFrontage based on its MSZoning and MSSubClass. This abritrary decision.
median_lotFrontage_by_ZoningClass = train_X.groupby(['MSZoning','MSSubClass'])['LotFrontage'].median()
train_X.groupby(['MSZoning','MSSubClass'])['LotFrontage'].median()

# Start imputation
Out[56]:
MSZoning  MSSubClass
C (all)   20            58.0
          30            90.0
          50            66.0
          70            50.0
          190           60.0
FV        20            73.5
          60            75.0
          120           50.0
          160           30.0
RH        20            60.0
          30            70.0
          45            60.0
          50            55.0
          70            54.5
          90            82.0
          120           34.0
          190           60.0
RL        20            75.0
          30            60.0
          40            60.0
          45            57.0
          50            60.0
          60            79.0
          70            66.0
          75            60.0
          80            80.0
          85            73.0
          90            71.0
          120           44.0
          160           36.0
          190           71.5
RM        20            70.0
          30            56.0
          40            40.0
          45            53.0
          50            52.0
          60            60.0
          70            60.0
          75            70.0
          90            68.0
          120           40.5
          160           21.0
          180           21.0
          190           60.0
Name: LotFrontage, dtype: float64
In [57]:
# Imputation
# Python may give warning like this: 
# SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
# Disable this by write this code instead
# pd.options.mode.chained_assignment = None  # default='warn'
#.loc[row_index,col_indexer] = value instead
pd.options.mode.chained_assignment = None
miss_lotFrontage = train_X[train_X['LotFrontage'].isnull()].index.tolist()
for j in miss_lotFrontage:
    for i in range(len(median_lotFrontage_by_ZoningClass)):
        if(train_X.loc[j,'MSZoning'] == median_lotFrontage_by_ZoningClass.index[i][0] and train_X.loc[j,'MSSubClass']==median_lotFrontage_by_ZoningClass.index[i][1]):
            train_X.loc[j,'LotFrontage'] = median_lotFrontage_by_ZoningClass[i]
In [58]:
# Imputation Success
train_X.loc[miss_lotFrontage,'LotFrontage']
Out[58]:
817     75.0
1164    80.0
186     80.0
1441    40.5
465     40.5
        ... 
845     73.0
537     75.0
1033    75.0
1383    60.0
559     44.0
Name: LotFrontage, Length: 212, dtype: float64
In [59]:
train_X['LotFrontage'].isnull().sum() # no more missing value
Out[59]:
0
In [60]:
# Don't forget to do the same thing to validation and test set
print(val_X['LotFrontage'].isnull().sum()) # 47 missing values in validation dataset
print(ameHouse_test['LotFrontage'].isnull().sum()) # 227 missing values in test dataset
47
227
In [61]:
Valid_median_lotFrontage_by_ZoningClass = val_X.groupby(['MSZoning','MSSubClass'])['LotFrontage'].median()
Test_median_lotFrontage_by_ZoningClass = ameHouse_test.groupby(['MSZoning','MSSubClass'])['LotFrontage'].median()
In [62]:
Valid_median_lotFrontage_by_ZoningClass
# There is a null value of median in validation dataset. I replace it with the median from the same grouping in test dataset
Valid_median_lotFrontage_by_ZoningClass[5] = 60
Valid_median_lotFrontage_by_ZoningClass
Out[62]:
MSZoning  MSSubClass
C (all)   50            60.0
FV        20            78.0
          60            75.0
          120           41.0
          160           34.0
RH        190           60.0
RL        20            71.0
          30            47.5
          50            79.0
          60            77.5
          70            60.0
          80            72.5
          85            68.0
          90            64.5
          120           47.0
          160           24.0
          190           60.0
RM        20            50.0
          30            60.0
          40            50.0
          45            50.0
          50            55.5
          70            60.0
          75            76.0
          90            78.0
          120           48.0
          160           21.0
          180           21.0
          190           65.0
Name: LotFrontage, dtype: float64
In [63]:
# Apparently, test data set has different columns set with missing value compared to valid and training data set. 
# I put it aside for now, and will deal with it in separate section.
Test_median_lotFrontage_by_ZoningClass
# ameHouse_test.isnull().sum() #uncomment this line to show which columns has missing value in test dataset 
Out[63]:
MSZoning  MSSubClass
C (all)   20            65.0
          30            66.0
          50            60.0
          70            60.0
          190           62.5
FV        20            75.0
          60            80.0
          120           36.0
          160           24.0
RH        20            80.0
          30            70.0
          50            70.0
          90            60.0
          120           26.0
          190           59.0
RL        20            75.0
          30            60.0
          40            58.5
          45            74.5
          50            60.0
          60            75.0
          70            61.5
          75            72.0
          80            74.0
          85            72.0
          90            70.0
          120           45.0
          150            NaN
          160           24.0
          190           70.0
RM        20            63.5
          30            56.0
          45            50.0
          50            56.0
          60            58.5
          70            60.0
          75            65.0
          80            76.0
          85            86.0
          90            68.0
          120           42.5
          160           21.0
          180           28.0
          190           60.0
Name: LotFrontage, dtype: float64
In [64]:
# get all row of valid dataset with missing lotFrontage value
Valid_miss_lotFrontage = val_X[val_X['LotFrontage'].isnull()].index.tolist()
# Imputed  the value with median
for j in Valid_miss_lotFrontage:
    for i in range(len(Valid_median_lotFrontage_by_ZoningClass)):
        if(val_X.loc[j,'MSZoning'] == Valid_median_lotFrontage_by_ZoningClass.index[i][0] and val_X.loc[j,'MSSubClass']==Valid_median_lotFrontage_by_ZoningClass.index[i][1]):
            val_X.loc[j,'LotFrontage'] = Valid_median_lotFrontage_by_ZoningClass[i]
In [65]:
# Check the imputed value
val_X['LotFrontage'].isnull().sum() # No More Missing Value
Out[65]:
0
In [66]:
# MasVnrType and MasVnrArea Imputation
# The reason i group these two together is because they are represents the same thing, that is Masonry Venree.
# Based on data documentation, a house without Masonry Venree valued as None.
# Logically, there will be no Masonry area if no Masonry Venree is build
# and that's why MasVnrType and MasVnrArea has the same number of missing value
# To impute this features, i will replace None value with string "None" for MasVnrType and 0 value for MasVnrArea
miss_MasVnr = train_X[train_X['MasVnrType'].isnull()].index.tolist()
# Imputing MasVnrType
train_X.loc[miss_MasVnr, 'MasVnrType'] = "None"
# Imputing MasVnrArea
train_X.loc[miss_MasVnr, 'MasVnrArea'] = 0.0
In [67]:
# check
print(train_X['MasVnrType'].isnull().sum()) # no missing value
print(train_X['MasVnrArea'].isnull().sum()) # no missing value
0
0
In [68]:
# Don't forget to do the same thing to validation and test dataset
# Check the missing value 
print("Missing value of MasVnrType in val_X: {}".format(val_X['MasVnrType'].isnull().sum()))
print("Missing value of MasVnrArea in val_X: {}".format(val_X['MasVnrArea'].isnull().sum()))
print("Missing value of MasVnrType in ameHouse_test: {}".format(ameHouse_test['MasVnrType'].isnull().sum()))
print("Missing value of MasVnrArea in ameHouse_test: {}".format(ameHouse_test['MasVnrArea'].isnull().sum()))
Missing value of MasVnrType in val_X: 2
Missing value of MasVnrArea in val_X: 2
Missing value of MasVnrType in ameHouse_test: 16
Missing value of MasVnrArea in ameHouse_test: 15
In [69]:
# get the index of missing value in validation dataset
Val_miss_MasVnr = val_X[val_X['MasVnrType'].isnull()].index.tolist()
# Start imputing
val_X.loc[Val_miss_MasVnr,'MasVnrType'] = "None"
val_X.loc[Val_miss_MasVnr,'MasVnrArea'] = 0.0
In [70]:
# Check missing
print("Missing value of MasVnrType in val_X: {}".format(val_X['MasVnrType'].isnull().sum()))
print("Missing value of MasVnrArea in val_X: {}".format(val_X['MasVnrArea'].isnull().sum()))
Missing value of MasVnrType in val_X: 0
Missing value of MasVnrArea in val_X: 0
In [71]:
# Imputing test dataset MassVnr
Test_miss_MasVnr = ameHouse_test[ameHouse_test['MasVnrType'].isnull()].index.tolist()
ameHouse_test.loc[Test_miss_MasVnr, 'MasVnrType'] = "None"
ameHouse_test.loc[Test_miss_MasVnr, 'MasVnrArea'] = 0.0
print("Missing value of MasVnrType in ameHouse_test: {}".format(ameHouse_test['MasVnrType'].isnull().sum()))
print("Missing value of MasVnrArea in ameHouse_test: {}".format(ameHouse_test['MasVnrArea'].isnull().sum()))
Missing value of MasVnrType in ameHouse_test: 0
Missing value of MasVnrArea in ameHouse_test: 0
In [72]:
#  Imputing BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2
# Same as MasVnr... , These 4 features represents basement. House without basement  will valued as NA
# Because all of them are categorical features, i will replace it with string 'NoBasement'
Train_miss_Bsmt=train_X[train_X['BsmtQual'].isnull()][['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2']].index.tolist()
train_X.loc[Train_miss_Bsmt,['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']]='NoBasement'
In [73]:
print("Missing value of BsmtQual:{}".format(train_X['BsmtQual'].isnull().sum()))
print("Missing value of BsmtCond:{}".format(train_X['BsmtCond'].isnull().sum()))
print("Missing value of BsmtExposure:{}".format(train_X['BsmtExposure'].isnull().sum()))
print("Missing value of BsmtFinType1:{}".format(train_X['BsmtFinType1'].isnull().sum()))
print("Missing value of BsmtFinType2:{}".format(train_X['BsmtFinType2'].isnull().sum()))
Missing value of BsmtQual:0
Missing value of BsmtCond:0
Missing value of BsmtExposure:0
Missing value of BsmtFinType1:0
Missing value of BsmtFinType2:1
In [74]:
train_X[train_X['BsmtFinType2'].isnull()][['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','BsmtFinSF2']]
Out[74]:
BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2 BsmtFinSF2
332 Gd TA No GLQ NaN 479
In [75]:
# i use this proportion of finished basement type 2 for imputing the only remaining missing value of basement type 2
train_X.loc[train_X['BsmtFinType2']!="Unf",'BsmtFinType2'].value_counts()
Out[75]:
LwQ           42
Rec           39
BLQ           30
NoBasement    28
ALQ           14
GLQ           11
Name: BsmtFinType2, dtype: int64
In [76]:
train_X.loc[332, 'BsmtFinType2'] = "LwQ"
print("Missing value of BsmtFinType2:{}".format(train_X['BsmtFinType2'].isnull().sum()))
Missing value of BsmtFinType2:0
In [77]:
print(val_X[['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']].isnull().sum(),end="\n\n")
print(ameHouse_test[['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']].isnull().sum())
BsmtQual         9
BsmtCond         9
BsmtExposure    10
BsmtFinType1     9
BsmtFinType2     9
dtype: int64

BsmtQual        44
BsmtCond        45
BsmtExposure    44
BsmtFinType1    42
BsmtFinType2    42
dtype: int64
In [78]:
val_miss_bsmt = val_X[val_X['BsmtExposure'].isnull()].index.tolist()
val_X.loc[val_miss_bsmt,['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']]
Out[78]:
BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2
1049 NaN NaN NaN NaN NaN
1000 NaN NaN NaN NaN NaN
362 NaN NaN NaN NaN NaN
182 NaN NaN NaN NaN NaN
897 NaN NaN NaN NaN NaN
1090 NaN NaN NaN NaN NaN
342 NaN NaN NaN NaN NaN
39 NaN NaN NaN NaN NaN
948 Gd TA NaN Unf Unf
520 NaN NaN NaN NaN NaN
In [79]:
sns.countplot(x='BsmtQual',hue='BsmtExposure',data=val_X)
plt.show()
In [80]:
# Based on the chart above, it seems that basement with Good Quality(90-99 inch) has higher propotion for basement with
# Good Exposure. It may come from the fact that higher basement able to get optimum sunlight and air circulation exposure.
# I use this as my imputation basis for row number 984

val_X.loc[948,'BsmtExposure'] = 'Gd'
In [81]:
# Let's update the missing value index
val_miss_bsmt = val_X[val_X['BsmtExposure'].isnull()].index.tolist()
val_miss_bsmt
# Imputed the remaining rows
val_X.loc[val_miss_bsmt,['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']]='NoBasement'
In [82]:
# Check the value
val_X.loc[val_miss_bsmt,['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']]
val_X[['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']].isnull().sum() # Clear
Out[82]:
BsmtQual        0
BsmtCond        0
BsmtExposure    0
BsmtFinType1    0
BsmtFinType2    0
dtype: int64
In [83]:
# Basement in Test data set 
Test_miss_bsmt = ameHouse_test.loc[ameHouse_test['BsmtFinType1'].isnull()].index.tolist()
ameHouse_test.loc[Test_miss_bsmt, ['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']]
Out[83]:
BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2
125 NaN NaN NaN NaN NaN
133 NaN NaN NaN NaN NaN
269 NaN NaN NaN NaN NaN
318 NaN NaN NaN NaN NaN
354 NaN NaN NaN NaN NaN
387 NaN NaN NaN NaN NaN
388 NaN NaN NaN NaN NaN
396 NaN NaN NaN NaN NaN
397 NaN NaN NaN NaN NaN
398 NaN NaN NaN NaN NaN
400 NaN NaN NaN NaN NaN
455 NaN NaN NaN NaN NaN
590 NaN NaN NaN NaN NaN
606 NaN NaN NaN NaN NaN
608 NaN NaN NaN NaN NaN
660 NaN NaN NaN NaN NaN
662 NaN NaN NaN NaN NaN
728 NaN NaN NaN NaN NaN
729 NaN NaN NaN NaN NaN
730 NaN NaN NaN NaN NaN
733 NaN NaN NaN NaN NaN
756 NaN NaN NaN NaN NaN
764 NaN NaN NaN NaN NaN
927 NaN NaN NaN NaN NaN
975 NaN NaN NaN NaN NaN
992 NaN NaN NaN NaN NaN
993 NaN NaN NaN NaN NaN
1030 NaN NaN NaN NaN NaN
1038 NaN NaN NaN NaN NaN
1087 NaN NaN NaN NaN NaN
1092 NaN NaN NaN NaN NaN
1104 NaN NaN NaN NaN NaN
1118 NaN NaN NaN NaN NaN
1139 NaN NaN NaN NaN NaN
1242 NaN NaN NaN NaN NaN
1303 NaN NaN NaN NaN NaN
1306 NaN NaN NaN NaN NaN
1343 NaN NaN NaN NaN NaN
1344 NaN NaN NaN NaN NaN
1364 NaN NaN NaN NaN NaN
1431 NaN NaN NaN NaN NaN
1444 NaN NaN NaN NaN NaN
In [84]:
# Impute this rows with the same method as Train and Valid dataset
ameHouse_test.loc[Test_miss_bsmt,['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']] = 'NoBasement'
ameHouse_test[['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']].isnull().sum()
Out[84]:
BsmtQual        2
BsmtCond        3
BsmtExposure    2
BsmtFinType1    0
BsmtFinType2    0
dtype: int64
In [85]:
Test2_miss_bsmt = ameHouse_test[ameHouse_test['BsmtExposure'].isnull()].index.tolist()
In [86]:
Test_inquiry_bsmt = ameHouse_test.loc[(ameHouse_test['BsmtQual']=='Gd') & (ameHouse_test['BsmtCond']=='TA') & (ameHouse_test['BsmtFinType1']=='Unf') 
                 & (ameHouse_test['BsmtFinType2']=='Unf')].index.tolist()
ameHouse_test.loc[Test_inquiry_bsmt,'BsmtExposure'].value_counts()
# We mining all rows from test dataset that has the same condition 
# Start from this, i will impute the missing BsmtExposure with 'No'
Out[86]:
No    119
Av     32
Mn     12
Gd      5
Name: BsmtExposure, dtype: int64
In [87]:
ameHouse_test.loc[Test2_miss_bsmt,'BsmtExposure'] = 'No'
ameHouse_test[['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']].isnull().sum()
Out[87]:
BsmtQual        2
BsmtCond        3
BsmtExposure    0
BsmtFinType1    0
BsmtFinType2    0
dtype: int64
In [88]:
# imputing the BsmtQual in test dataset
Test3_miss_bsmt = ameHouse_test.loc[ameHouse_test['BsmtQual'].isnull()].index.tolist()
ameHouse_test.loc[Test3_miss_bsmt,['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']]
Out[88]:
BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2
757 NaN Fa No Unf Unf
758 NaN TA No Unf Unf
In [89]:
Test4_miss_bsmt = ameHouse_test.loc[(ameHouse_test['BsmtCond']=='Fa') | (ameHouse_test['BsmtCond']=='TA') & (ameHouse_test['BsmtExposure']=="No") &
                                   (ameHouse_test['BsmtFinType1']=="Unf") & (ameHouse_test['BsmtFinType2']=='Unf')].index.tolist()
ameHouse_test.loc[Test4_miss_bsmt,'BsmtQual'].value_counts()
Out[89]:
TA    193
Gd    128
Fa     35
Ex      9
Name: BsmtQual, dtype: int64
In [90]:
# Impute BsmtQual in test dataset with 'TA'

ameHouse_test.loc[Test4_miss_bsmt,'BsmtQual'] = 'TA'
ameHouse_test[['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']].isnull().sum()
Out[90]:
BsmtQual        0
BsmtCond        3
BsmtExposure    0
BsmtFinType1    0
BsmtFinType2    0
dtype: int64
In [91]:
# Imputing BsmtCond in Test dataset
Test5_miss_bsmt = ameHouse_test.loc[ameHouse_test['BsmtCond'].isnull()].index.tolist()
ameHouse_test.loc[Test5_miss_bsmt,['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']]
Out[91]:
BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2
580 Gd NaN Mn GLQ Rec
725 TA NaN No BLQ Unf
1064 TA NaN Av ALQ Unf
In [92]:
# I use median for this SimpleImputer strategy that is most_frequent
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
# define the imputer
simpleImp= SimpleImputer(missing_values = np.nan, strategy='most_frequent')
# Define the ColumnTransformer
column_trans = ColumnTransformer([('imputed_BsmtCond',simpleImp,[30])])
imputed_ameHouse_test = column_trans.fit_transform(ameHouse_test)

#new data
imputed_ameHouse_test
Out[92]:
array([['TA'],
       ['TA'],
       ['TA'],
       ...,
       ['TA'],
       ['TA'],
       ['TA']], dtype=object)
In [93]:
ameHouse_test['imputed_BsmtCond'] = imputed_ameHouse_test
ameHouse_test.columns
ameHouse_test.drop('BsmtCond',axis=1,inplace=True)
ameHouse_test.shape
Out[93]:
(1459, 75)
In [94]:
ameHouse_test['imputed_BsmtCond']
Out[94]:
0       TA
1       TA
2       TA
3       TA
4       TA
        ..
1454    TA
1455    TA
1456    TA
1457    TA
1458    TA
Name: imputed_BsmtCond, Length: 1459, dtype: object
In [95]:
ameHouse_test.isnull().sum()
Out[95]:
Id                    0
MSSubClass            0
MSZoning              4
LotFrontage         227
LotArea               0
                   ... 
MoSold                0
YrSold                0
SaleType              1
SaleCondition         0
imputed_BsmtCond      0
Length: 75, dtype: int64
In [96]:
# Electrical Missing Value
miss_elec = train_X.loc[train_X['Electrical'].isnull()].index
miss_elec
Out[96]:
Int64Index([1379], dtype='int64')
In [97]:
figElec,elc = plt.subplots(1,2, figsize=(15,5))
sns.countplot(ax = elc[0],x='Electrical',hue="MSZoning",data=train_X).set_title('Electrical System by MSZoning')
elc[0].legend(loc="upper right",title='MSZoning')
sns.countplot(ax=elc[1],x='Electrical',hue='MSSubClass',data=train_X).set_title('Electrical System by MSSubClass')
elc[1].legend(loc="upper right",title='MSSubClass')
plt.show()
In [98]:
# From these two visualization, we acquire insight about which electrical system used by house from across certain category most
train_X.loc[miss_elec,['MSZoning','MSSubClass']]
# Impute the missing value with SBrkr
train_X.loc[miss_elec,'Electrical'] = 'SBrkr'
print("Missing value of Electrical: {} ".format(train_X['Electrical'].isnull().sum()))
Missing value of Electrical: 0 
In [99]:
# Let's do the same for validation and test dataset
print("Missing value of Electrical in validation dataset : {}".format(val_X['Electrical'].isnull().sum()))
print("Missing value of Electrical in test dataset : {}".format(ameHouse_test['Electrical'].isnull().sum()))
Missing value of Electrical in validation dataset : 0
Missing value of Electrical in test dataset : 0
In [100]:
#At first glance, the Garage feature has the same imputation properties as basement and Masonry. But look in more detail, 
#and we realize that there is one feature called GarageYrBlt that cannot be imputed arbitrarily. Logically, when there is 
#no garage, there is no development. Imputing the GarageYrBlt variable with various techniques, as far as I know, 
# will only make this feature lose its essence. Therefore, I decided to ditch the row where these features miss.
 
miss_garage = train_X.loc[train_X['GarageType'].isnull()].index.tolist()
train_X.loc[miss_garage,['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond','GarageCars']]
Out[100]:
GarageType GarageYrBlt GarageFinish GarageQual GarageCond GarageCars
431 NaN NaN NaN NaN NaN 0
638 NaN NaN NaN NaN NaN 0
1337 NaN NaN NaN NaN NaN 0
528 NaN NaN NaN NaN NaN 0
960 NaN NaN NaN NaN NaN 0
1011 NaN NaN NaN NaN NaN 0
1283 NaN NaN NaN NaN NaN 0
1038 NaN NaN NaN NaN NaN 0
165 NaN NaN NaN NaN NaN 0
1349 NaN NaN NaN NaN NaN 0
533 NaN NaN NaN NaN NaN 0
386 NaN NaN NaN NaN NaN 0
88 NaN NaN NaN NaN NaN 0
1453 NaN NaN NaN NaN NaN 0
1449 NaN NaN NaN NaN NaN 0
784 NaN NaN NaN NaN NaN 0
495 NaN NaN NaN NaN NaN 0
942 NaN NaN NaN NaN NaN 0
127 NaN NaN NaN NaN NaN 0
441 NaN NaN NaN NaN NaN 0
210 NaN NaN NaN NaN NaN 0
970 NaN NaN NaN NaN NaN 0
48 NaN NaN NaN NaN NaN 0
636 NaN NaN NaN NaN NaN 0
1123 NaN NaN NaN NaN NaN 0
1234 NaN NaN NaN NaN NaN 0
78 NaN NaN NaN NaN NaN 0
250 NaN NaN NaN NaN NaN 0
614 NaN NaN NaN NaN NaN 0
968 NaN NaN NaN NaN NaN 0
582 NaN NaN NaN NaN NaN 0
155 NaN NaN NaN NaN NaN 0
125 NaN NaN NaN NaN NaN 0
1131 NaN NaN NaN NaN NaN 0
1137 NaN NaN NaN NaN NaN 0
375 NaN NaN NaN NaN NaN 0
1219 NaN NaN NaN NaN NaN 0
635 NaN NaN NaN NaN NaN 0
738 NaN NaN NaN NaN NaN 0
1009 NaN NaN NaN NaN NaN 0
1179 NaN NaN NaN NaN NaN 0
1325 NaN NaN NaN NaN NaN 0
163 NaN NaN NaN NaN NaN 0
1407 NaN NaN NaN NaN NaN 0
750 NaN NaN NaN NaN NaN 0
1218 NaN NaN NaN NaN NaN 0
710 NaN NaN NaN NaN NaN 0
1326 NaN NaN NaN NaN NaN 0
287 NaN NaN NaN NaN NaN 0
843 NaN NaN NaN NaN NaN 0
307 NaN NaN NaN NaN NaN 0
562 NaN NaN NaN NaN NaN 0
954 NaN NaN NaN NaN NaN 0
1143 NaN NaN NaN NaN NaN 0
291 NaN NaN NaN NaN NaN 0
99 NaN NaN NaN NaN NaN 0
976 NaN NaN NaN NaN NaN 0
705 NaN NaN NaN NaN NaN 0
In [101]:
train_X.drop(index=miss_garage, inplace=True)
train_X[['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond','GarageCars']].isnull().sum()
Out[101]:
GarageType      0
GarageYrBlt     0
GarageFinish    0
GarageQual      0
GarageCond      0
GarageCars      0
dtype: int64
In [102]:
# Check all features
train_X[miss_col].isnull().sum()
# Check the data dimension
train_X.shape # From 1168 to 1110, i lost 58 rows, but fortunately, still retain 95% data
Out[102]:
(1110, 75)
In [103]:
# Garage Features in validation and test dataset
print(val_X[['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond']].isnull().sum(),end="\n\n")
print(ameHouse_test[['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond']].isnull().sum())
GarageType      23
GarageYrBlt     23
GarageFinish    23
GarageQual      23
GarageCond      23
dtype: int64

GarageType      76
GarageYrBlt     78
GarageFinish    78
GarageQual      78
GarageCond      78
dtype: int64
In [104]:
Valid_miss_garage = val_X.loc[val_X['GarageType'].isnull()].index.tolist()
val_X.loc[Valid_miss_garage,['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond']]
Out[104]:
GarageType GarageYrBlt GarageFinish GarageQual GarageCond
89 NaN NaN NaN NaN NaN
649 NaN NaN NaN NaN NaN
393 NaN NaN NaN NaN NaN
535 NaN NaN NaN NaN NaN
1173 NaN NaN NaN NaN NaN
1096 NaN NaN NaN NaN NaN
148 NaN NaN NaN NaN NaN
921 NaN NaN NaN NaN NaN
613 NaN NaN NaN NaN NaN
108 NaN NaN NaN NaN NaN
464 NaN NaN NaN NaN NaN
1030 NaN NaN NaN NaN NaN
620 NaN NaN NaN NaN NaN
826 NaN NaN NaN NaN NaN
434 NaN NaN NaN NaN NaN
140 NaN NaN NaN NaN NaN
1450 NaN NaN NaN NaN NaN
39 NaN NaN NaN NaN NaN
1257 NaN NaN NaN NaN NaN
241 NaN NaN NaN NaN NaN
520 NaN NaN NaN NaN NaN
198 NaN NaN NaN NaN NaN
1323 NaN NaN NaN NaN NaN
In [105]:
# drop all Garage Features with missing value
val_X.drop(index=Valid_miss_garage, inplace=True)
val_X[['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond','GarageCars']].isnull().sum()
Out[105]:
GarageType      0
GarageYrBlt     0
GarageFinish    0
GarageQual      0
GarageCond      0
GarageCars      0
dtype: int64
In [106]:
val_X.shape # I able to keep 92,12% of the data after drop  imputation
Out[106]:
(269, 75)
In [107]:
Test_miss_garage = ameHouse_test.loc[ameHouse_test['GarageFinish'].isnull()].index.tolist()
print(ameHouse_test.loc[Test_miss_garage,['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond']])
     GarageType  GarageYrBlt GarageFinish GarageQual GarageCond
53          NaN          NaN          NaN        NaN        NaN
71          NaN          NaN          NaN        NaN        NaN
79          NaN          NaN          NaN        NaN        NaN
92          NaN          NaN          NaN        NaN        NaN
96          NaN          NaN          NaN        NaN        NaN
...         ...          ...          ...        ...        ...
1433        NaN          NaN          NaN        NaN        NaN
1449        NaN          NaN          NaN        NaN        NaN
1453        NaN          NaN          NaN        NaN        NaN
1454        NaN          NaN          NaN        NaN        NaN
1457        NaN          NaN          NaN        NaN        NaN

[78 rows x 5 columns]
In [108]:
ameHouse_test.loc[(ameHouse_test['GarageType'].notnull()) & (ameHouse_test['GarageFinish'].isnull())]
ameHouse_test.loc[[666,1116],['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond']]
Out[108]:
GarageType GarageYrBlt GarageFinish GarageQual GarageCond
666 Detchd NaN NaN NaN NaN
1116 Detchd NaN NaN NaN NaN
In [109]:
# There is a litle confusion from test dataset, but i consider that losing 78 rows or trying to imputate two rows is clear 
# decision line. I decide to drop all 78 rows, because it's simpler and the cost are relatively low.
#ameHouse_test.drop(index=Test_miss_garage, inplace=True)
ameHouse_test[['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond','GarageCars']].isnull().sum()
Out[109]:
GarageType      76
GarageYrBlt     78
GarageFinish    78
GarageQual      78
GarageCond      78
GarageCars       1
dtype: int64
In [110]:
ameHouse_test.shape # i still have 94,65 of the original dataset
Out[110]:
(1459, 75)
In [111]:
# Check for missing values in train and validation data
columns_miss_train = len([col for col in train_X.columns if train_X[col].isnull().sum()>0])
columns_miss_val = len([col for col in val_X.columns if val_X[col].isnull().sum()>0])
columns_miss_test = len([col for col in ameHouse_test.columns if ameHouse_test[col].isnull().sum()>0])

print(columns_miss_train)
print(columns_miss_val)
print(columns_miss_test)
0
0
21
In [112]:
# Check for miss value columns in test dataset
miss_test_col = [col for col in ameHouse_test.columns if ameHouse_test[col].isnull().sum()>0]
ameHouse_test[miss_test_col].isnull().sum()
Out[112]:
MSZoning          4
LotFrontage     227
Utilities         2
Exterior1st       1
Exterior2nd       1
BsmtFinSF1        1
BsmtFinSF2        1
BsmtUnfSF         1
TotalBsmtSF       1
BsmtFullBath      2
BsmtHalfBath      2
KitchenQual       1
Functional        2
GarageType       76
GarageYrBlt      78
GarageFinish     78
GarageCars        1
GarageArea        1
GarageQual       78
GarageCond       78
SaleType          1
dtype: int64
In [113]:
# For efficiency, i decide to use KNN Imputation technique for numerical features and simpleimputer "most_frequent" for string
from sklearn.impute import KNNImputer
KnnImp = KNNImputer(n_neighbors = 6)
In [114]:
ames_test_cat =ameHouse_test.select_dtypes("object").columns.tolist()
ames_test_num =ameHouse_test.select_dtypes(["int64","float64"]).columns.tolist()
In [115]:
# ColumnTransformer for two Columns
catTestImputer = SimpleImputer(strategy = 'most_frequent')
test_transformer = ColumnTransformer([('imputed_cat',catTestImputer,ames_test_cat),
                                     ('imputed_num',KnnImp,ames_test_num)])
test_imputed_ames = test_transformer.fit_transform(ameHouse_test)
In [116]:
new_test_col = ameHouse_test.columns
new_test_col
Out[116]:
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtExposure',
       'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF',
       'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical',
       '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath',
       'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr',
       'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
       'MoSold', 'YrSold', 'SaleType', 'SaleCondition', 'imputed_BsmtCond'],
      dtype='object')
In [117]:
imputed_ames_test_new = pd.DataFrame(test_imputed_ames, columns=new_test_col)
In [118]:
new_col =ames_test_cat + ames_test_num
imputed_ames_test_new = pd.DataFrame(test_imputed_ames, columns=new_col).reset_index()
imputed_ames_test_new.drop('Id',axis=1,inplace=True)
imputed_ames_test_new
Out[118]:
index MSZoning Street LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 ... GarageArea WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold
0 0 RH Pave Reg Lvl AllPub Inside Gtl NAmes Feedr ... 730.0 140.0 0.0 0.0 0.0 120.0 0.0 0.0 6.0 2010.0
1 1 RL Pave IR1 Lvl AllPub Corner Gtl NAmes Norm ... 312.0 393.0 36.0 0.0 0.0 0.0 0.0 12500.0 6.0 2010.0
2 2 RL Pave IR1 Lvl AllPub Inside Gtl Gilbert Norm ... 482.0 212.0 34.0 0.0 0.0 0.0 0.0 0.0 3.0 2010.0
3 3 RL Pave IR1 Lvl AllPub Inside Gtl Gilbert Norm ... 470.0 360.0 36.0 0.0 0.0 0.0 0.0 0.0 6.0 2010.0
4 4 RL Pave IR1 HLS AllPub Inside Gtl StoneBr Norm ... 506.0 0.0 82.0 0.0 0.0 144.0 0.0 0.0 1.0 2010.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1454 1454 RM Pave Reg Lvl AllPub Inside Gtl MeadowV Norm ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.0 2006.0
1455 1455 RM Pave Reg Lvl AllPub Inside Gtl MeadowV Norm ... 286.0 0.0 24.0 0.0 0.0 0.0 0.0 0.0 4.0 2006.0
1456 1456 RL Pave Reg Lvl AllPub Inside Gtl Mitchel Norm ... 576.0 474.0 0.0 0.0 0.0 0.0 0.0 0.0 9.0 2006.0
1457 1457 RL Pave Reg Lvl AllPub Inside Gtl Mitchel Norm ... 0.0 80.0 32.0 0.0 0.0 0.0 0.0 700.0 7.0 2006.0
1458 1458 RL Pave Reg Lvl AllPub Inside Mod Mitchel Norm ... 650.0 190.0 48.0 0.0 0.0 0.0 0.0 0.0 11.0 2006.0

1459 rows × 75 columns

In [119]:
# Check for missing value
print("Missing value in test dataset: {}".format(len([col for col in imputed_ames_test_new.columns if imputed_ames_test_new[col].isnull().sum()>0])))
print("Missing value in train dataset: {}".format(len([col for col in train_X.columns if train_X[col].isnull().sum()>0])))
print("Missing value in validation dataset: {}".format(len([col for col in val_X.columns if val_X[col].isnull().sum()])))
Missing value in test dataset: 0
Missing value in train dataset: 0
Missing value in validation dataset: 0

Next step¶

Now that we are done with missing value in all dataset, we proceed to the next steps. That is, another cleaning. Here i explain what i would do for the next few steps.

  • Dealing with categorical features
  • Dealing with outliers and possibility of scaling
  • Parsing
  • Check for inconsistenty

After that, i would continue with FEATURE ENGINEERING, MODEL BUILDING, MODEL ENSEMBLING, and finally, draw one or some conclusions from the data.

Dealing with Categorical and Ordinal Features¶

Typically, there are two method to work with categorical/ordinal features (3 actually, but i ditch it because it just about drop non-numerical features), i.e:

  • Ordinal encoding
  • One-Hot encoding

The difference between the two lies in the output of the data encoding. Ordinal encoding will convert category data into ordinal data, while one-hot encoding will convert category data into nominal data, where each category will have its own column.

In [120]:
print("Object-type data column in train_X :{}".format(len(train_X.select_dtypes('object').columns))) # 38
print(train_X.select_dtypes('object').nunique()) 
ames_cat_col = train_X.select_dtypes('object').columns.tolist()
Object-type data column in train_X :38
MSZoning          5
Street            2
LotShape          4
LandContour       4
Utilities         2
LotConfig         5
LandSlope         3
Neighborhood     25
Condition1        9
Condition2        6
BldgType          5
HouseStyle        8
RoofStyle         6
RoofMatl          7
Exterior1st      14
Exterior2nd      15
MasVnrType        4
ExterQual         4
ExterCond         4
Foundation        6
BsmtQual          5
BsmtCond          5
BsmtExposure      5
BsmtFinType1      7
BsmtFinType2      7
Heating           6
HeatingQC         5
CentralAir        2
Electrical        5
KitchenQual       4
Functional        6
GarageType        6
GarageFinish      3
GarageQual        5
GarageCond        5
PavedDrive        3
SaleType          8
SaleCondition     6
dtype: int64
In [121]:
# Another thing we need to know especially for the Ames, Iowa housing data, is that there are numerical features that we can 
# treat as categorical data, or specifically as ordinal data type.
# To choose the right features that we would treat as ordinal, first, we filter the one with low cardinality, 
# and second, we read the data documentation to make sure that they are indeed have orndinal nature.
low_num_card = []
count_num = train_X.select_dtypes(['int64','float64']).nunique()
for i in count_num.index:
    if count_num[i] <= 15:
        low_num_card.append(i)
low_num_card
Out[121]:
['MSSubClass',
 'OverallQual',
 'OverallCond',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'Fireplaces',
 'GarageCars',
 'PoolArea',
 'MoSold',
 'YrSold']

There are 15 numerical features with low cardinality. Then, after double check with data documentation, here is the result:

  • MSSubClass --- > Nominal
  • OverallQual --- > Ordinal
  • OverallCond --- > Ordinal
  • BsmtFullBath --- > Ordinal
  • BsmtHalfBath --- > Ordinal
  • FullBath --- > Ordinal
  • HalfBath --- > Ordinal
  • BedroomAbvGr --- > Ordinal
  • KitchenAbvGr --- > Ordinal
  • TotRmsAbvGrd --- > Ordinal
  • FirePlaces --- > Ordinal
  • GarageCars --- > Ordinal
  • PoolArea --- > Numeric
  • MoSold --- > Numeric (DateTime)
  • YrSold --- > Numeric (DateTime)

So, 12 out of 15 low cardinality numeric features can be treated as Ordinal/nominal features. Hence, i will take them out from numeric features group later.

In [122]:
ames_cat_col = ames_cat_col + low_num_card[:11]
ames_cat_col
Out[122]:
['MSZoning',
 'Street',
 'LotShape',
 'LandContour',
 'Utilities',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'RoofStyle',
 'RoofMatl',
 'Exterior1st',
 'Exterior2nd',
 'MasVnrType',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Heating',
 'HeatingQC',
 'CentralAir',
 'Electrical',
 'KitchenQual',
 'Functional',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PavedDrive',
 'SaleType',
 'SaleCondition',
 'MSSubClass',
 'OverallQual',
 'OverallCond',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'Fireplaces']
In [123]:
# From categorical / ordinal group of features, i need to decide which one should be encoded with Ordinal or One-Hot technique
# based on it's nature
# Usually i need to read the data documentation to understand what each features represents
# But, there are too many features.  So, let's take the most efficient strategy
# Basically, almost all features with string value would be considered to be encoded with One-Hot encoding,
# Exception for some, features like 'ExterQual','ExterCond', 'BsmtCond', 'BsmtQual','BsmtExposure','BsmtFinType1',
# 'BsmtFinType2','HeatingQC','KitchenQual','GarageQual', and 'GarageCond' would be encoded with One-Hot encoding
# The reason why i encode some features above with Ordinal method is because they inherently represents quality level,
# and presummably, those leveling would better represented as ordinal data.

# Ordinal Encoding
ame_cat_to_ordinal = ['ExterQual','ExterCond', 'BsmtCond', 'BsmtQual','BsmtExposure','BsmtFinType1','BsmtFinType2','HeatingQC',
                      'KitchenQual','GarageQual','GarageCond']+ames_cat_col[38:]
# create a copy for this column in different dataframe
label_ordinal_train_X = train_X[ame_cat_to_ordinal].copy(deep=True)
label_ordinal_val_X = val_X[ame_cat_to_ordinal].copy(deep=True)
In [124]:
# for test dataset
imputed_ames_test_new.rename(columns={'index':'Id','imputed_BsmtCond':'BsmtCond'},inplace=True)
In [125]:
label_ordinal_test = imputed_ames_test_new[ame_cat_to_ordinal].copy(deep=True)
label_ordinal_test.nunique()
Out[125]:
ExterQual        4
ExterCond        5
BsmtCond         5
BsmtQual         5
BsmtExposure     5
BsmtFinType1     7
BsmtFinType2     7
HeatingQC        5
KitchenQual      4
GarageQual       4
GarageCond       5
MSSubClass      16
OverallQual     10
OverallCond      9
BsmtFullBath     5
BsmtHalfBath     3
FullBath         5
HalfBath         3
BedroomAbvGr     7
KitchenAbvGr     3
TotRmsAbvGrd    12
Fireplaces       5
dtype: int64
In [126]:
label_ordinal_val_X.nunique()
Out[126]:
ExterQual        4
ExterCond        3
BsmtCond         4
BsmtQual         5
BsmtExposure     5
BsmtFinType1     7
BsmtFinType2     7
HeatingQC        4
KitchenQual      4
GarageQual       4
GarageCond       5
MSSubClass      14
OverallQual      8
OverallCond      7
BsmtFullBath     3
BsmtHalfBath     3
FullBath         4
HalfBath         3
BedroomAbvGr     7
KitchenAbvGr     2
TotRmsAbvGrd     9
Fireplaces       4
dtype: int64
In [127]:
# import OrdinalEncoder module
from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder(handle_unknown='use_encoded_value',unknown_value=15)
label_ordinal_train_X[:] = ordinal_encoder.fit_transform(label_ordinal_train_X)
label_ordinal_val_X[:] = ordinal_encoder.transform(label_ordinal_val_X)
In [128]:
# For test dataset
label_ordinal_test[:] = ordinal_encoder.transform(label_ordinal_test)
In [129]:
label_ordinal_train_X
label_ordinal_val_X
label_ordinal_test
Out[129]:
ExterQual ExterCond BsmtCond BsmtQual BsmtExposure BsmtFinType1 BsmtFinType2 HeatingQC KitchenQual GarageQual ... OverallQual OverallCond BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces
0 3.0 3.0 4.0 4.0 3.0 5.0 3.0 4.0 3.0 4.0 ... 3.0 4.0 0.0 0.0 1.0 0.0 2.0 0.0 2.0 0.0
1 3.0 3.0 4.0 4.0 3.0 0.0 6.0 4.0 2.0 4.0 ... 4.0 4.0 0.0 0.0 1.0 1.0 3.0 0.0 3.0 0.0
2 3.0 3.0 4.0 2.0 3.0 2.0 6.0 2.0 3.0 4.0 ... 3.0 3.0 0.0 0.0 2.0 1.0 3.0 0.0 3.0 1.0
3 3.0 3.0 4.0 4.0 3.0 2.0 6.0 0.0 2.0 4.0 ... 4.0 4.0 0.0 0.0 2.0 1.0 3.0 0.0 4.0 1.0
4 2.0 3.0 4.0 2.0 3.0 0.0 6.0 0.0 2.0 4.0 ... 6.0 3.0 0.0 0.0 2.0 0.0 2.0 0.0 2.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1454 3.0 3.0 4.0 4.0 3.0 6.0 6.0 2.0 3.0 4.0 ... 2.0 5.0 0.0 0.0 1.0 1.0 3.0 0.0 2.0 0.0
1455 3.0 3.0 4.0 4.0 3.0 5.0 6.0 4.0 3.0 4.0 ... 2.0 3.0 0.0 0.0 1.0 1.0 3.0 0.0 3.0 0.0
1456 3.0 3.0 4.0 4.0 3.0 0.0 6.0 0.0 3.0 4.0 ... 3.0 5.0 1.0 0.0 1.0 0.0 4.0 0.0 4.0 1.0
1457 3.0 3.0 4.0 2.0 0.0 2.0 6.0 4.0 3.0 4.0 ... 3.0 3.0 0.0 1.0 1.0 0.0 3.0 0.0 3.0 0.0
1458 3.0 3.0 4.0 2.0 0.0 3.0 6.0 0.0 3.0 4.0 ... 5.0 3.0 0.0 0.0 2.0 1.0 3.0 0.0 6.0 1.0

1459 rows × 22 columns

In [130]:
ame_cat_to_onehot = [col for col in ames_cat_col if col not in ame_cat_to_ordinal]
label_onehot_train_X = train_X[ame_cat_to_onehot].copy(deep=True)
label_onehot_val_X = val_X[ame_cat_to_onehot].copy(deep=True)
In [131]:
# For test dataset
label_onehot_test = imputed_ames_test_new[ame_cat_to_onehot].copy(deep=True)
label_onehot_test
Out[131]:
MSZoning Street LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 ... Foundation Heating CentralAir Electrical Functional GarageType GarageFinish PavedDrive SaleType SaleCondition
0 RH Pave Reg Lvl AllPub Inside Gtl NAmes Feedr Norm ... CBlock GasA Y SBrkr Typ Attchd Unf Y WD Normal
1 RL Pave IR1 Lvl AllPub Corner Gtl NAmes Norm Norm ... CBlock GasA Y SBrkr Typ Attchd Unf Y WD Normal
2 RL Pave IR1 Lvl AllPub Inside Gtl Gilbert Norm Norm ... PConc GasA Y SBrkr Typ Attchd Fin Y WD Normal
3 RL Pave IR1 Lvl AllPub Inside Gtl Gilbert Norm Norm ... PConc GasA Y SBrkr Typ Attchd Fin Y WD Normal
4 RL Pave IR1 HLS AllPub Inside Gtl StoneBr Norm Norm ... PConc GasA Y SBrkr Typ Attchd RFn Y WD Normal
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1454 RM Pave Reg Lvl AllPub Inside Gtl MeadowV Norm Norm ... CBlock GasA Y SBrkr Typ Attchd Unf Y WD Normal
1455 RM Pave Reg Lvl AllPub Inside Gtl MeadowV Norm Norm ... CBlock GasA Y SBrkr Typ CarPort Unf Y WD Abnorml
1456 RL Pave Reg Lvl AllPub Inside Gtl Mitchel Norm Norm ... CBlock GasA Y SBrkr Typ Detchd Unf Y WD Abnorml
1457 RL Pave Reg Lvl AllPub Inside Gtl Mitchel Norm Norm ... PConc GasA Y SBrkr Typ Attchd Unf Y WD Normal
1458 RL Pave Reg Lvl AllPub Inside Mod Mitchel Norm Norm ... PConc GasA Y SBrkr Typ Attchd Fin Y WD Normal

1459 rows × 27 columns

In [132]:
label_onehot_train_X
Out[132]:
MSZoning Street LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 ... Foundation Heating CentralAir Electrical Functional GarageType GarageFinish PavedDrive SaleType SaleCondition
618 RL Pave Reg Lvl AllPub Inside Gtl NridgHt Norm Norm ... PConc GasA Y SBrkr Typ Attchd Unf Y New Partial
870 RL Pave Reg Lvl AllPub Inside Gtl NAmes PosN Norm ... CBlock GasA N SBrkr Typ Detchd Unf Y WD Normal
92 RL Pave IR1 HLS AllPub Inside Gtl Crawfor Norm Norm ... BrkTil GasA Y SBrkr Typ Detchd Unf Y WD Normal
817 RL Pave IR1 Lvl AllPub CulDSac Gtl Mitchel Norm Norm ... PConc GasA Y SBrkr Typ Attchd RFn Y WD Normal
302 RL Pave IR1 Lvl AllPub Corner Gtl CollgCr Norm Norm ... PConc GasA Y SBrkr Typ Attchd RFn Y WD Normal
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
763 RL Pave Reg Lvl AllPub Inside Gtl NoRidge Norm Norm ... PConc GasA Y SBrkr Typ Attchd RFn Y WD Normal
835 RL Pave Reg Lvl AllPub Inside Gtl Sawyer Norm Norm ... CBlock GasA Y SBrkr Min2 Attchd Unf Y WD Normal
1216 RM Pave Reg Lvl AllPub Inside Gtl Sawyer RRAe Norm ... Slab GasA Y SBrkr Typ Attchd Unf Y WD Normal
559 RL Pave Reg Lvl AllPub Inside Gtl Blmngtn Norm Norm ... PConc GasA Y SBrkr Typ Attchd Fin Y WD Normal
684 RL Pave IR2 Lvl AllPub CulDSac Gtl NoRidge Norm Norm ... PConc GasA Y SBrkr Typ Attchd Fin Y WD Normal

1110 rows × 27 columns

In [133]:
# One Hot Encoder
from sklearn.preprocessing import OneHotEncoder
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(label_onehot_train_X[:]))
OH_cols_val = pd.DataFrame(OH_encoder.transform(label_onehot_val_X[:]))
In [134]:
# For test dataset
OH_cols_test = pd.DataFrame(OH_encoder.transform(label_onehot_test[:]))
OH_cols_test.index = label_onehot_test.index
OH_cols_test
Out[134]:
0 1 2 3 4 5 6 7 8 9 ... 165 166 167 168 169 170 171 172 173 174
0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
1 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
2 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
3 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
4 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1454 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
1455 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
1456 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
1457 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
1458 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0

1459 rows × 175 columns

In [135]:
OH_cols_train.index = label_onehot_train_X.index
OH_cols_val.index = label_onehot_val_X.index
In [136]:
OH_encoder.inverse_transform(OH_cols_train)
OH_col_names = OH_encoder.get_feature_names().tolist()
C:\Users\Galih Dwika Putra R\AppData\Roaming\Python\Python39\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)
In [137]:
# Change the columns name
OH_cols_train.columns = OH_col_names
OH_cols_val.columns = OH_col_names
In [138]:
OH_cols_test.columns = OH_col_names
OH_cols_test
Out[138]:
x0_C (all) x0_FV x0_RH x0_RL x0_RM x1_Grvl x1_Pave x2_IR1 x2_IR2 x2_IR3 ... x25_ConLI x25_ConLw x25_New x25_WD x26_Abnorml x26_AdjLand x26_Alloca x26_Family x26_Normal x26_Partial
0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
1 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
2 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
3 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
4 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1454 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
1455 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
1456 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
1457 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
1458 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0

1459 rows × 175 columns

In [139]:
# join two data into one dataframe filled with transformed categorical features
ames_transform_all_cat_train = pd.concat([label_ordinal_train_X,OH_cols_train],axis=1)
ames_transform_all_cat_val = pd.concat([label_ordinal_val_X,OH_cols_val],axis=1)
In [140]:
# For test dataset
ames_transform_all_cat_test = pd.concat([label_ordinal_test,OH_cols_test],axis=1)
ames_transform_all_cat_test
Out[140]:
ExterQual ExterCond BsmtCond BsmtQual BsmtExposure BsmtFinType1 BsmtFinType2 HeatingQC KitchenQual GarageQual ... x25_ConLI x25_ConLw x25_New x25_WD x26_Abnorml x26_AdjLand x26_Alloca x26_Family x26_Normal x26_Partial
0 3.0 3.0 4.0 4.0 3.0 5.0 3.0 4.0 3.0 4.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
1 3.0 3.0 4.0 4.0 3.0 0.0 6.0 4.0 2.0 4.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
2 3.0 3.0 4.0 2.0 3.0 2.0 6.0 2.0 3.0 4.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
3 3.0 3.0 4.0 4.0 3.0 2.0 6.0 0.0 2.0 4.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
4 2.0 3.0 4.0 2.0 3.0 0.0 6.0 0.0 2.0 4.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1454 3.0 3.0 4.0 4.0 3.0 6.0 6.0 2.0 3.0 4.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
1455 3.0 3.0 4.0 4.0 3.0 5.0 6.0 4.0 3.0 4.0 ... 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
1456 3.0 3.0 4.0 4.0 3.0 0.0 6.0 0.0 3.0 4.0 ... 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
1457 3.0 3.0 4.0 2.0 0.0 2.0 6.0 4.0 3.0 4.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
1458 3.0 3.0 4.0 2.0 0.0 3.0 6.0 0.0 3.0 4.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0

1459 rows × 197 columns

In [141]:
ames_transform_all_cat_train.head()
Out[141]:
ExterQual ExterCond BsmtCond BsmtQual BsmtExposure BsmtFinType1 BsmtFinType2 HeatingQC KitchenQual GarageQual ... x25_ConLI x25_ConLw x25_New x25_WD x26_Abnorml x26_AdjLand x26_Alloca x26_Family x26_Normal x26_Partial
618 0.0 3.0 4.0 0.0 0.0 2.0 6.0 0.0 2.0 4.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
870 3.0 3.0 4.0 4.0 3.0 6.0 6.0 2.0 3.0 4.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
92 3.0 2.0 4.0 2.0 3.0 0.0 6.0 0.0 3.0 4.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
817 2.0 3.0 4.0 2.0 3.0 2.0 6.0 0.0 2.0 4.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
302 2.0 3.0 4.0 2.0 3.0 6.0 6.0 0.0 2.0 4.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0

5 rows × 197 columns

In [142]:
ames_transform_all_cat_val.head()
Out[142]:
ExterQual ExterCond BsmtCond BsmtQual BsmtExposure BsmtFinType1 BsmtFinType2 HeatingQC KitchenQual GarageQual ... x25_ConLI x25_ConLw x25_New x25_WD x26_Abnorml x26_AdjLand x26_Alloca x26_Family x26_Normal x26_Partial
529 2.0 3.0 4.0 4.0 3.0 5.0 6.0 4.0 3.0 4.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0
491 3.0 3.0 4.0 4.0 3.0 1.0 5.0 4.0 1.0 4.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
459 3.0 3.0 4.0 4.0 3.0 3.0 6.0 4.0 2.0 4.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
279 3.0 3.0 4.0 2.0 3.0 1.0 6.0 0.0 3.0 4.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
655 3.0 3.0 4.0 4.0 3.0 6.0 6.0 4.0 3.0 4.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0

5 rows × 197 columns

In [143]:
# don't forget to drop all categorical/ordinal features from the original dataset
num_train_X = train_X.drop(ames_cat_col, axis = 1)
num_val_X = val_X.drop(ames_cat_col, axis = 1)
In [144]:
# for test dataset
num_test = imputed_ames_test_new.drop(ames_cat_col,axis=1)
num_test
Out[144]:
Id LotFrontage LotArea YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF ... GarageArea WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold
0 0 80.0 11622.0 1961.0 1961.0 0.0 468.0 144.0 270.0 882.0 ... 730.0 140.0 0.0 0.0 0.0 120.0 0.0 0.0 6.0 2010.0
1 1 81.0 14267.0 1958.0 1958.0 108.0 923.0 0.0 406.0 1329.0 ... 312.0 393.0 36.0 0.0 0.0 0.0 0.0 12500.0 6.0 2010.0
2 2 74.0 13830.0 1997.0 1998.0 0.0 791.0 0.0 137.0 928.0 ... 482.0 212.0 34.0 0.0 0.0 0.0 0.0 0.0 3.0 2010.0
3 3 78.0 9978.0 1998.0 1998.0 20.0 602.0 0.0 324.0 926.0 ... 470.0 360.0 36.0 0.0 0.0 0.0 0.0 0.0 6.0 2010.0
4 4 43.0 5005.0 1992.0 1992.0 0.0 263.0 0.0 1017.0 1280.0 ... 506.0 0.0 82.0 0.0 0.0 144.0 0.0 0.0 1.0 2010.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1454 1454 21.0 1936.0 1970.0 1970.0 0.0 0.0 0.0 546.0 546.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.0 2006.0
1455 1455 21.0 1894.0 1970.0 1970.0 0.0 252.0 0.0 294.0 546.0 ... 286.0 0.0 24.0 0.0 0.0 0.0 0.0 0.0 4.0 2006.0
1456 1456 160.0 20000.0 1960.0 1996.0 0.0 1224.0 0.0 0.0 1224.0 ... 576.0 474.0 0.0 0.0 0.0 0.0 0.0 0.0 9.0 2006.0
1457 1457 62.0 10441.0 1992.0 1992.0 0.0 337.0 0.0 575.0 912.0 ... 0.0 80.0 32.0 0.0 0.0 0.0 0.0 700.0 7.0 2006.0
1458 1458 74.0 9627.0 1993.0 1994.0 94.0 758.0 0.0 238.0 996.0 ... 650.0 190.0 48.0 0.0 0.0 0.0 0.0 0.0 11.0 2006.0

1459 rows × 26 columns

In [145]:
num_train_X.head()
Out[145]:
Id LotFrontage LotArea YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF ... GarageArea WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold
618 619 90.0 11694 2007 2007 452.0 48 0 1774 1822 ... 774 0 108 0 0 260 0 0 7 2007
870 871 60.0 6600 1962 1962 0.0 0 0 894 894 ... 308 0 0 0 0 0 0 0 8 2009
92 93 80.0 13360 1921 2006 0.0 713 0 163 876 ... 432 0 0 44 0 0 0 0 8 2009
817 818 75.0 13265 2002 2002 148.0 1218 0 350 1568 ... 857 150 59 0 0 0 0 0 7 2008
302 303 118.0 13704 2001 2002 150.0 0 0 1541 1541 ... 843 468 81 0 0 0 0 0 1 2006

5 rows × 26 columns

In [146]:
# Now merge the only-number dataframe with transformed categorical dataframe
all_num_cat_trans_train_X = pd.concat([num_train_X,ames_transform_all_cat_train],axis=1)
all_num_cat_trans_val_X = pd.concat([num_val_X,ames_transform_all_cat_val],axis=1)
In [147]:
# For test dataset
all_num_cat_trans_test = pd.concat([num_test,ames_transform_all_cat_test],axis=1)
all_num_cat_trans_test.shape
Out[147]:
(1459, 223)
In [148]:
print(all_num_cat_trans_train_X.shape)
print(all_num_cat_trans_val_X.shape)
(1110, 223)
(269, 223)

Standarization, and Normalization¶

We are done with categorical features. The features increase drastically, and perhaps would impact the model performance significantly, for better or worse. one thing for sure is all data features are now of numeric type. They all have different range and distribution; My next job is to examine and make sure this difference will not reduce model's performance.

In this part, i will do normality test using Shapiro-Wilk against only the continuous or 'original number features' so i called. Then, features that do not follow Gaussian distribution will go through a normalization procedure using Robust Scaler.

Side note: Shapiro-Wilk Test's hypothesis are: H0 : The population is distributed normally H1 : The population is not distributed normally

In [149]:
# first of all,  i will make 'Id' as dataframe index
all_num_cat_trans_train_X = all_num_cat_trans_train_X.set_index('Id')
all_num_cat_trans_val_X = all_num_cat_trans_val_X.set_index('Id')
In [150]:
# For test dataset
all_num_cat_trans_test = all_num_cat_trans_test.set_index('Id')
In [151]:
# Next, i will exclude dummy and ordinal features from strandarization process
num_features = [col for col in all_num_cat_trans_train_X.columns if col in num_train_X]
len(num_features)
Out[151]:
25
In [152]:
# import shapiro module
from scipy.stats import shapiro

# Perform the test
def ShapiroTest(df, sig):
    p_val = []
    decs = []
    cols = []
    for i in df.columns:
        shap,res = shapiro(df[i])
        cols.append(i)
        p_val.append(res)
        if res > sig:
            decs.append("Normal Distribution")
        else:
            decs.append("Not Normal Distributrion")
    res_df = pd.DataFrame({'Feature': cols, 'P-Value':p_val, 'Conclusion': decs})
    return res_df
In [153]:
Shapiro_test_train = ShapiroTest(all_num_cat_trans_train_X[num_features],0.05)
In [154]:
Shapiro_test_train
Out[154]:
Feature P-Value Conclusion
0 LotFrontage 3.162906e-26 Not Normal Distributrion
1 LotArea 0.000000e+00 Not Normal Distributrion
2 YearBuilt 1.070809e-23 Not Normal Distributrion
3 YearRemodAdd 3.671455e-30 Not Normal Distributrion
4 MasVnrArea 5.507103e-43 Not Normal Distributrion
5 BsmtFinSF1 2.696543e-27 Not Normal Distributrion
6 BsmtFinSF2 0.000000e+00 Not Normal Distributrion
7 BsmtUnfSF 4.893147e-22 Not Normal Distributrion
8 TotalBsmtSF 2.406882e-17 Not Normal Distributrion
9 1stFlrSF 3.416720e-19 Not Normal Distributrion
10 2ndFlrSF 1.329707e-36 Not Normal Distributrion
11 LowQualFinSF 0.000000e+00 Not Normal Distributrion
12 GrLivArea 4.646039e-20 Not Normal Distributrion
13 GarageYrBlt 1.433124e-23 Not Normal Distributrion
14 GarageCars 2.738633e-36 Not Normal Distributrion
15 GarageArea 1.079923e-16 Not Normal Distributrion
16 WoodDeckSF 1.312211e-36 Not Normal Distributrion
17 OpenPorchSF 1.260988e-38 Not Normal Distributrion
18 EnclosedPorch 0.000000e+00 Not Normal Distributrion
19 3SsnPorch 0.000000e+00 Not Normal Distributrion
20 ScreenPorch 0.000000e+00 Not Normal Distributrion
21 PoolArea 0.000000e+00 Not Normal Distributrion
22 MiscVal 0.000000e+00 Not Normal Distributrion
23 MoSold 3.222904e-15 Not Normal Distributrion
24 YrSold 1.196740e-26 Not Normal Distributrion
In [155]:
Shapiro_test_train['Conclusion'].value_counts() # All features do not follow Gaussian Distribution
Out[155]:
Not Normal Distributrion    25
Name: Conclusion, dtype: int64
In [156]:
Shapiro_test_val = ShapiroTest(all_num_cat_trans_val_X[num_features],0.05)
Shapiro_test_val 
Out[156]:
Feature P-Value Conclusion
0 LotFrontage 3.379728e-18 Not Normal Distributrion
1 LotArea 3.853450e-23 Not Normal Distributrion
2 YearBuilt 7.989654e-11 Not Normal Distributrion
3 YearRemodAdd 5.193416e-15 Not Normal Distributrion
4 MasVnrArea 2.387008e-22 Not Normal Distributrion
5 BsmtFinSF1 2.171661e-20 Not Normal Distributrion
6 BsmtFinSF2 1.714818e-30 Not Normal Distributrion
7 BsmtUnfSF 2.323560e-11 Not Normal Distributrion
8 TotalBsmtSF 1.464569e-17 Not Normal Distributrion
9 1stFlrSF 3.761116e-15 Not Normal Distributrion
10 2ndFlrSF 1.013616e-19 Not Normal Distributrion
11 LowQualFinSF 1.358333e-34 Not Normal Distributrion
12 GrLivArea 1.549194e-15 Not Normal Distributrion
13 GarageYrBlt 4.751121e-11 Not Normal Distributrion
14 GarageCars 3.655624e-18 Not Normal Distributrion
15 GarageArea 8.067395e-09 Not Normal Distributrion
16 WoodDeckSF 2.535409e-18 Not Normal Distributrion
17 OpenPorchSF 4.363552e-19 Not Normal Distributrion
18 EnclosedPorch 6.642034e-30 Not Normal Distributrion
19 3SsnPorch 1.272099e-33 Not Normal Distributrion
20 ScreenPorch 2.572267e-30 Not Normal Distributrion
21 PoolArea 5.303774e-34 Not Normal Distributrion
22 MiscVal 5.866475e-34 Not Normal Distributrion
23 MoSold 3.800067e-05 Not Normal Distributrion
24 YrSold 5.578459e-13 Not Normal Distributrion
In [157]:
# Now, we know that all numerical features, either from train or validation dataset are not comes from normal distribution
# So i will perform normalization for all of them. But before that, 'MoSold','YrSold', and 'GarageYrBlt' should be separate
# because they are datetime features, and standarization in any form will make no sense.
year_feature = ['YearBuilt','YearRemodAdd','GarageYrBlt','MoSold','YrSold']
num_features = [col for col in num_features if col not in year_feature]
len(num_features)
Out[157]:
20
In [158]:
# Distribution Visual
figDis,axDis = plt.subplots(5,4,figsize=(15,15))
sns.histplot(ax=axDis[0,0],x='LotFrontage',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[0,1],x='LotArea',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[0,2],x='MasVnrArea',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[0,3],x='BsmtFinSF1',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[1,0],x='BsmtFinSF2',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[1,1],x='BsmtUnfSF',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[1,2],x='TotalBsmtSF',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[1,3],x='1stFlrSF',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[2,0],x='2ndFlrSF',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[2,1],x='GrLivArea',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[2,2],x='GarageCars',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[2,3],x='GarageArea',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[3,0],x='WoodDeckSF',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[3,1],x='OpenPorchSF',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[3,2],x='EnclosedPorch',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[3,3],x='3SsnPorch',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[4,0],x='ScreenPorch',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[4,1],x='PoolArea',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[4,2],x='MiscVal',data=all_num_cat_trans_train_X)
sns.histplot(ax=axDis[4,3],x='LowQualFinSF',data=all_num_cat_trans_train_X)
plt.subplots_adjust(wspace=0.4)
plt.show()
In [159]:
# Scaling predictor and target features
from sklearn.preprocessing import RobustScaler
num_feature = num_features[:20]
num_feature
robustScaler = RobustScaler()
scaled_num_robust_train_X = pd.DataFrame(robustScaler.fit_transform(all_num_cat_trans_train_X[num_feature]),
                                       columns=num_feature)
scaled_num_robust_val_X = pd.DataFrame(robustScaler.transform(all_num_cat_trans_val_X[num_feature]),
                                      columns=num_feature)
In [160]:
# For Test dataset
num_feature = num_features[:20]
scaled_num_robust_test = pd.DataFrame(robustScaler.transform(all_num_cat_trans_test[num_feature]),
                                     columns=num_feature)
scaled_num_robust_test
Out[160]:
LotFrontage LotArea MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea GarageCars GarageArea WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal
0 0.35 0.513038 0.000000 0.094816 144.0 -0.361218 -0.260600 -0.406386 0.000000 0.0 -0.931292 -1.0 1.219331 0.828402 -0.405797 0.0 0.0 120.0 0.0 0.0
1 0.40 1.182658 0.622478 0.715553 0.0 -0.127842 0.663909 0.431543 0.000000 0.0 -0.251276 -1.0 -0.852540 2.325444 0.115942 0.0 0.0 0.0 0.0 12500.0
2 0.05 1.072025 0.000000 0.535471 0.0 -0.589447 -0.165460 -0.344461 0.948579 0.0 0.219867 0.0 -0.009913 1.254438 0.086957 0.0 0.0 0.0 0.0 0.0
3 0.25 0.096835 0.115274 0.277626 0.0 -0.268554 -0.169597 -0.348331 0.917456 0.0 0.180605 0.0 -0.069393 2.130178 0.115942 0.0 0.0 0.0 0.0 0.0
4 -1.50 -1.162152 0.000000 -0.184857 0.0 0.920635 0.562565 0.336720 0.000000 0.0 -0.328229 0.0 0.109046 0.000000 0.782609 0.0 0.0 144.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1454 -2.60 -1.939114 0.000000 -0.543656 0.0 0.112398 -0.955533 -1.083696 0.738836 0.0 -0.623479 -2.0 -2.399009 0.000000 -0.405797 0.0 0.0 0.0 0.0 0.0
1455 -2.60 -1.949747 0.000000 -0.199864 0.0 -0.320034 -0.955533 -1.083696 0.738836 0.0 -0.623479 -1.0 -0.981413 0.000000 -0.057971 0.0 0.0 0.0 0.0 0.0
1456 4.35 2.634051 0.000000 1.126194 0.0 -0.824539 0.446743 0.228350 0.000000 0.0 -0.416176 0.0 0.456010 2.804734 -0.405797 0.0 0.0 0.0 0.0 0.0
1457 -0.55 0.214051 0.000000 -0.083902 0.0 0.162162 -0.198552 -0.263183 0.000000 0.0 -0.815077 -2.0 -2.399009 0.473373 0.057971 0.0 0.0 0.0 0.0 700.0
1458 0.05 0.007975 0.541787 0.490450 0.0 -0.416130 -0.024819 -0.212869 1.358593 0.0 0.802513 1.0 0.822800 1.124260 0.289855 0.0 0.0 0.0 0.0 0.0

1459 rows × 20 columns

In [161]:
all_num_cat_trans_train_X.shape
cat_features = [col for col in all_num_cat_trans_train_X.columns if col not in num_features]
len(cat_features) # 202 cat features
len(scaled_num_robust_train_X.columns) # 20 features
Out[161]:
20
In [162]:
scaled_num_robust_train_X.index = all_num_cat_trans_train_X.index
scaled_num_robust_val_X.index = all_num_cat_trans_val_X.index
In [163]:
# drop all number features from original dataset in train and val

all_cat_trans_train_X = all_num_cat_trans_train_X[cat_features]
all_cat_trans_val_X = all_num_cat_trans_val_X[cat_features]
len(all_cat_trans_train_X.columns)
Out[163]:
202
In [164]:
# For test dataset
all_cat_trans_test = all_num_cat_trans_test[cat_features]
len(all_cat_trans_test.columns)
Out[164]:
202
In [165]:
scaled_all_num_cat_robust_trainX = pd.concat([scaled_num_robust_train_X,all_cat_trans_train_X],axis=1)
scaled_all_num_cat_robust_valX = pd.concat([scaled_num_robust_val_X,all_cat_trans_val_X],axis=1)
scaled_all_num_cat_robust_trainX
Out[165]:
LotFrontage LotArea MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF 1stFlrSF 2ndFlrSF LowQualFinSF ... x25_ConLI x25_ConLw x25_New x25_WD x26_Abnorml x26_AdjLand x26_Alloca x26_Family x26_Normal x26_Partial
Id
619 0.85 0.531266 2.605187 -0.478172 0.0 2.219648 1.683557 1.397194 0.000000 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
871 -0.65 -0.758354 0.000000 -0.543656 0.0 0.709567 -0.235781 -0.410256 0.000000 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
93 0.35 0.953038 0.000000 0.429059 0.0 -0.544831 -0.273009 -0.274794 0.000000 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
818 0.10 0.928987 0.853026 1.118008 0.0 -0.223938 1.158221 1.128205 0.000000 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
303 2.25 1.040127 0.864553 -0.543656 0.0 1.819820 1.102378 0.841800 0.000000 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
764 0.45 -0.041899 3.878963 1.042974 0.0 -0.671815 0.504654 0.313498 1.484438 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
836 -0.65 0.001139 0.000000 0.059345 0.0 0.247962 0.122027 -0.075472 0.000000 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
1217 -0.25 -0.168481 0.000000 -0.543656 0.0 -0.824539 -2.084798 0.410256 0.790257 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
560 -1.45 -1.620127 0.103746 -0.543656 0.0 1.533248 0.756980 0.872762 0.000000 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
685 -0.75 1.816329 0.172911 -0.543656 0.0 1.226083 0.386763 0.172230 0.871448 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0

1110 rows × 222 columns

In [166]:
# Concate for test dataset
scaled_all_num_cat_robust_test = pd.concat([scaled_num_robust_test,all_cat_trans_test],axis=1)
scaled_all_num_cat_robust_test
Out[166]:
LotFrontage LotArea MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF 1stFlrSF 2ndFlrSF LowQualFinSF ... x25_ConLI x25_ConLw x25_New x25_WD x26_Abnorml x26_AdjLand x26_Alloca x26_Family x26_Normal x26_Partial
0 0.35 0.513038 0.000000 0.094816 144.0 -0.361218 -0.260600 -0.406386 0.000000 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
1 0.40 1.182658 0.622478 0.715553 0.0 -0.127842 0.663909 0.431543 0.000000 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
2 0.05 1.072025 0.000000 0.535471 0.0 -0.589447 -0.165460 -0.344461 0.948579 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
3 0.25 0.096835 0.115274 0.277626 0.0 -0.268554 -0.169597 -0.348331 0.917456 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
4 -1.50 -1.162152 0.000000 -0.184857 0.0 0.920635 0.562565 0.336720 0.000000 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1454 -2.60 -1.939114 0.000000 -0.543656 0.0 0.112398 -0.955533 -1.083696 0.738836 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
1455 -2.60 -1.949747 0.000000 -0.199864 0.0 -0.320034 -0.955533 -1.083696 0.738836 0.0 ... 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
1456 4.35 2.634051 0.000000 1.126194 0.0 -0.824539 0.446743 0.228350 0.000000 0.0 ... 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
1457 -0.55 0.214051 0.000000 -0.083902 0.0 0.162162 -0.198552 -0.263183 0.000000 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
1458 0.05 0.007975 0.541787 0.490450 0.0 -0.416130 -0.024819 -0.212869 1.358593 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0

1459 rows × 222 columns

In [167]:
scaled_all_num_cat_robust_trainX
Out[167]:
LotFrontage LotArea MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF 1stFlrSF 2ndFlrSF LowQualFinSF ... x25_ConLI x25_ConLw x25_New x25_WD x26_Abnorml x26_AdjLand x26_Alloca x26_Family x26_Normal x26_Partial
Id
619 0.85 0.531266 2.605187 -0.478172 0.0 2.219648 1.683557 1.397194 0.000000 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
871 -0.65 -0.758354 0.000000 -0.543656 0.0 0.709567 -0.235781 -0.410256 0.000000 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
93 0.35 0.953038 0.000000 0.429059 0.0 -0.544831 -0.273009 -0.274794 0.000000 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
818 0.10 0.928987 0.853026 1.118008 0.0 -0.223938 1.158221 1.128205 0.000000 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
303 2.25 1.040127 0.864553 -0.543656 0.0 1.819820 1.102378 0.841800 0.000000 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
764 0.45 -0.041899 3.878963 1.042974 0.0 -0.671815 0.504654 0.313498 1.484438 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
836 -0.65 0.001139 0.000000 0.059345 0.0 0.247962 0.122027 -0.075472 0.000000 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
1217 -0.25 -0.168481 0.000000 -0.543656 0.0 -0.824539 -2.084798 0.410256 0.790257 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
560 -1.45 -1.620127 0.103746 -0.543656 0.0 1.533248 0.756980 0.872762 0.000000 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
685 -0.75 1.816329 0.172911 -0.543656 0.0 1.226083 0.386763 0.172230 0.871448 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0

1110 rows × 222 columns

In [168]:
scaled_all_num_cat_robust_valX
Out[168]:
LotFrontage LotArea MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF 1stFlrSF 2ndFlrSF LowQualFinSF ... x25_ConLI x25_ConLw x25_New x25_WD x26_Abnorml x26_AdjLand x26_Alloca x26_Family x26_Normal x26_Partial
Id
530 -0.10 5.841139 0.000000 1.119372 0.0 0.575719 2.124095 2.726657 0.000000 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0
492 0.30 -0.026709 0.000000 0.006139 165.0 -0.416130 -0.417787 -0.286405 0.838972 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
460 0.30 -0.653291 0.927954 -0.291269 0.0 0.074646 -0.618407 -0.245767 0.303112 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
280 0.50 0.103671 1.723343 -0.008868 0.0 0.493350 0.314374 0.096759 1.171854 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
656 -2.60 -2.003924 2.195965 -0.543656 0.0 0.076362 -0.998966 -1.124335 0.767253 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1030 -2.60 -2.003924 1.619597 -0.111187 0.0 -0.215358 -0.694933 -0.839865 0.738836 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
327 -2.05 0.316582 0.000000 1.665075 0.0 -0.652939 1.470527 1.186260 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
441 1.60 1.477342 1.152738 1.866985 539.0 0.527671 4.314374 2.507983 0.000000 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
1388 -0.65 -0.272278 0.000000 -0.314461 546.0 -0.824539 -0.608066 1.079826 1.166441 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
62 -0.65 -0.606456 0.000000 -0.543656 0.0 0.084942 -0.988625 -1.015965 0.717185 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0

269 rows × 222 columns

In [169]:
figDis,axDis = plt.subplots(5,4,figsize=(15,15))
sns.histplot(ax=axDis[0,0],x='LotFrontage',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[0,1],x='LotArea',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[0,2],x='MasVnrArea',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[0,3],x='BsmtFinSF1',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[1,0],x='BsmtFinSF2',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[1,1],x='BsmtUnfSF',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[1,2],x='TotalBsmtSF',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[1,3],x='1stFlrSF',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[2,0],x='2ndFlrSF',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[2,1],x='GrLivArea',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[2,2],x='GarageCars',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[2,3],x='GarageArea',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[3,0],x='WoodDeckSF',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[3,1],x='OpenPorchSF',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[3,2],x='EnclosedPorch',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[3,3],x='3SsnPorch',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[4,0],x='ScreenPorch',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[4,1],x='PoolArea',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[4,2],x='MiscVal',data=scaled_all_num_cat_robust_trainX)
sns.histplot(ax=axDis[4,3],x='LowQualFinSF',data=scaled_all_num_cat_robust_trainX)
plt.subplots_adjust(wspace=0.4)
plt.show()
In [170]:
# Experiment to normalize data using 
from sklearn.preprocessing import PowerTransformer
powerTrans = PowerTransformer()
normalize_num_train_X = pd.DataFrame(powerTrans.fit_transform(all_num_cat_trans_train_X[num_features]),columns=num_features)
normalize_num_train_X.shape
Out[170]:
(1110, 20)
In [171]:
figNorm,axNorm = plt.subplots(5,4,figsize=(15,15))
sns.histplot(ax=axNorm[0,0],x='LotFrontage',data=normalize_num_train_X)
sns.histplot(ax=axNorm[0,1],x='LotArea',data=normalize_num_train_X)
sns.histplot(ax=axNorm[0,2],x='MasVnrArea',data=normalize_num_train_X)
sns.histplot(ax=axNorm[0,3],x='BsmtFinSF1',data=normalize_num_train_X)
sns.histplot(ax=axNorm[1,0],x='BsmtFinSF2',data=normalize_num_train_X)
sns.histplot(ax=axNorm[1,1],x='BsmtUnfSF',data=normalize_num_train_X)
sns.histplot(ax=axNorm[1,2],x='TotalBsmtSF',data=normalize_num_train_X)
sns.histplot(ax=axNorm[1,3],x='1stFlrSF',data=normalize_num_train_X)
sns.histplot(ax=axNorm[2,0],x='2ndFlrSF',data=normalize_num_train_X)
sns.histplot(ax=axNorm[2,1],x='GrLivArea',data=normalize_num_train_X)
sns.histplot(ax=axNorm[2,2],x='GarageCars',data=normalize_num_train_X)
sns.histplot(ax=axNorm[2,3],x='GarageArea',data=normalize_num_train_X)
sns.histplot(ax=axNorm[3,0],x='WoodDeckSF',data=normalize_num_train_X)
sns.histplot(ax=axNorm[3,1],x='OpenPorchSF',data=normalize_num_train_X)
sns.histplot(ax=axNorm[3,2],x='EnclosedPorch',data=normalize_num_train_X)
sns.histplot(ax=axNorm[3,3],x='3SsnPorch',data=normalize_num_train_X)
sns.histplot(ax=axNorm[4,0],x='ScreenPorch',data=normalize_num_train_X)
sns.histplot(ax=axNorm[4,1],x='PoolArea',data=normalize_num_train_X)
sns.histplot(ax=axNorm[4,2],x='MiscVal',data=normalize_num_train_X)
sns.histplot(ax=axDis[4,3],x='LowQualFinSF',data=normalize_num_train_X)
plt.subplots_adjust(wspace=0.4)
plt.show()
In [172]:
# We can see that almost all numeric features distribution move closer to Gaussian distribution.

ShapiroTest(normalize_num_train_X,0.05)
# even so, all of them still can't pass normality test.
# So, here i decide to proceed my analysis based only on scaled data.  
Out[172]:
Feature P-Value Conclusion
0 LotFrontage 2.680891e-20 Not Normal Distributrion
1 LotArea 1.316176e-26 Not Normal Distributrion
2 MasVnrArea 1.129587e-41 Not Normal Distributrion
3 BsmtFinSF1 4.902530e-34 Not Normal Distributrion
4 BsmtFinSF2 0.000000e+00 Not Normal Distributrion
5 BsmtUnfSF 2.855252e-11 Not Normal Distributrion
6 TotalBsmtSF 3.500056e-17 Not Normal Distributrion
7 1stFlrSF 9.795268e-03 Not Normal Distributrion
8 2ndFlrSF 1.726400e-42 Not Normal Distributrion
9 LowQualFinSF 0.000000e+00 Not Normal Distributrion
10 GrLivArea 8.524834e-03 Not Normal Distributrion
11 GarageCars 2.182170e-36 Not Normal Distributrion
12 GarageArea 8.502025e-07 Not Normal Distributrion
13 WoodDeckSF 3.241063e-40 Not Normal Distributrion
14 OpenPorchSF 2.072875e-35 Not Normal Distributrion
15 EnclosedPorch 0.000000e+00 Not Normal Distributrion
16 3SsnPorch 0.000000e+00 Not Normal Distributrion
17 ScreenPorch 0.000000e+00 Not Normal Distributrion
18 PoolArea 0.000000e+00 Not Normal Distributrion
19 MiscVal 0.000000e+00 Not Normal Distributrion
In [173]:
# Transform y-feature (salePrice)
new_train_y = train_y.values.reshape(-1,1)
new_val_y = val_y.values.reshape(-1,1)
robustScaler_forPrice = RobustScaler()
scaled_train_y = pd.DataFrame(robustScaler_forPrice.fit_transform(new_train_y),columns=['SalePrice'])
scaled_val_y = pd.DataFrame(robustScaler_forPrice.transform(new_val_y),columns=['SalePrice'])
scaled_train_y.index = train_y.index+1
scaled_val_y.index = val_y.index+1
In [174]:
scaled_train_y.head()
Out[174]:
SalePrice
619 1.786035
871 -0.629412
93 0.005882
818 1.270588
303 0.494118
In [175]:
scaled_val_y.head()
Out[175]:
SalePrice
530 0.442635
492 -0.352941
460 -0.623529
280 0.341176
656 -0.882353
In [176]:
scaled_all_num_cat_robust_trainX.index
Out[176]:
Int64Index([ 619,  871,   93,  818,  303, 1455,   41,  960,   76, 1390,
            ...
            1095,  600,  278, 1034, 1384,  764,  836, 1217,  560,  685],
           dtype='int64', name='Id', length=1110)
In [177]:
# Adjust train and validation Y index
scaled_train_y = scaled_train_y.loc[scaled_all_num_cat_robust_trainX.index]
scaled_train_y
Out[177]:
SalePrice
Id
619 1.786035
871 -0.629412
93 0.005882
818 1.270588
303 0.494118
... ...
764 2.047059
836 -0.411765
1217 -0.600000
560 0.835294
685 0.682353

1110 rows × 1 columns

In [178]:
scaled_all_num_cat_robust_valX.index
Out[178]:
Int64Index([ 530,  492,  460,  280,  656, 1014, 1404,  602, 1183,  688,
            ...
            1360, 1302, 1428, 1403, 1224, 1030,  327,  441, 1388,   62],
           dtype='int64', name='Id', length=269)
In [179]:
# Adjust train and validation Y index
scaled_val_y = scaled_val_y.loc[scaled_all_num_cat_robust_valX.index]
scaled_val_y
Out[179]:
SalePrice
Id
530 0.442635
492 -0.352941
460 -0.623529
280 0.341176
656 -0.882353
... ...
1030 -0.529412
327 1.894118
441 4.611765
1388 -0.317647
62 -0.729412

269 rows × 1 columns

In [180]:
# Outliers
# I wouldn't do any outlier detection at this point
# The reason are : i already scaling it, and any outlier detection after this stage seems to generate a condition where
# all rows considered outliers. To prove this, run the following code to compare the result 
# of outlier detection using scaled and original dataset


## NOTICE : It is not a general rule, but maybe it would be more meaningfull to detect and treat outlier before all 
## preprocessing steps. Each case and each method of transformation has different impact.
## uncomment one of these part 

###############################################
##### Outliers detection in scaled data set
#Q1 = scaled_all_num_cat_robust_trainX.quantile(0.25)
#Q3 = scaled_all_num_cat_robust_trainX.quantile(0.75)

# calculate the IQR
#QR = Q3 - Q1

# filter the dataset with the IQR
#IQR_outliers = scaled_all_num_cat_robust_trainX[((scaled_all_num_cat_robust_trainX < (Q1 - 1.5 * IQR)) |(scaled_all_num_cat_robust_trainX > (Q3 + 1.5 * IQR))).any(axis=1)]
#IQR_outliers


###############################################

##### Outliers detection in original data set
#Q1 = train_X.quantile(0.25)
#Q3 = train_X.quantile(0.75)

# calculate the IQR
#QR = Q3 - Q1

# filter the dataset with the IQR
#IQR_outliers = train_X[((train_X < (Q1 - 1.5 * IQR)) |(train_X > (Q3 + 1.5 * IQR))).any(axis=1)]
#IQR_outliers

Feature Selection and Engineering¶

Ames Housing dataset has 222 features to use for model building. While it is a bliss to have lot of data, it is also a curse. More data means higher cost of maintaining database and collecting data. It is also burden our computing machine, and make time spend to train our model longer

Not all features need to be included in model, and apparently, some important predictor can only be obtained by processing and manipulating existing features to generate a new one.

For this section, i will conduct three feature selection method, i.e:

  • Mutual Information
  • Manual significance test
  • Foward Sequential Feature Selection (FSFS)

For a detailed explanation of each method, please read more credible sources. Personally, I recommend articles from analyticsvidhya.com and kdnuggets.com

Before i perform all selection process, i will do some feature engineering instinctively. If all process generate result as expected, then i would proceed feature selection with all method i mentioned before.

At the end of this section, i will have 3 different subsets of data ready to trained with model

In [181]:
# Feature Engineering

# Year Feature Vs SalePrice
figYr,axYr = plt.subplots(2,2, figsize=(10,8))
sns.lineplot(ax=axYr[0,0],x=scaled_all_num_cat_robust_trainX['YearBuilt'], y = scaled_train_y['SalePrice'])
sns.lineplot(ax=axYr[0,1],x=scaled_all_num_cat_robust_trainX['YearRemodAdd'],y=scaled_train_y['SalePrice'])
sns.lineplot(ax=axYr[1,0],x=scaled_all_num_cat_robust_trainX['MoSold'], y = scaled_train_y['SalePrice'])
sns.lineplot(ax=axYr[1,1],x=scaled_all_num_cat_robust_trainX['YrSold'],y=scaled_train_y['SalePrice'])
plt.show()

SalePrice has increasing trend against YearBuilt and YearRemodAdd. There are other factors involved like price level that should be considered, but because of the limitation of data, i decide to use this price as my analysis basis. Future research may be conducted by adding inflation level feature so the real price effect can be examined.

A support basis for this decision comes from Selim, H (2009) that analyzing determinant of house prices in Turkey. Selim,H use similiar features as Ames, Iowa dataset to break down how housing price in Turkey determined by applied and compared predicted result from Hedonic Regression and Artificial Neural Network. Selim,H(2009) also use nominal price selling as target features, and transform it using natural logarithm.

As opposite, SalePrice seems to have decreasing trend against YrSold. This may have connection to the fact that the sales record only available from 2006 to 2010, a time period around Global Financial Crisis 2009 due to subprime mortgage. This short period of time(4 years of record) is rather unbalance compared to YearBuilt or YearRemodAdd, hence, i probably won't use it as house price determinant because of the earlier reasoning and significant event that occurs at that period.

Rather than use year feature as it is, i will use age of the house. House age has been use as determinant feature and proved to be statistically significant by some papers, such as Selim, H(2009), and Xu, Y., Zhang, Q., Zheng, S. et al(2018)

I will generate two features, that is age based on building year and age based on remodelling year, to see the difference and its impact.

In [182]:
# engineering AgeBuilt
scaled_all_num_cat_robust_trainX['AgeBuilt'] = scaled_all_num_cat_robust_trainX['YrSold'] - scaled_all_num_cat_robust_trainX['YearBuilt']
# engineering AgeRemodAdd
scaled_all_num_cat_robust_trainX['AgeRemodAdd'] = scaled_all_num_cat_robust_trainX['YrSold'] - scaled_all_num_cat_robust_trainX['YearRemodAdd']
In [183]:
figAge, axAge = plt.subplots(1,2,figsize=(10,5))
sns.lineplot(ax=axAge[0],x=scaled_all_num_cat_robust_trainX['AgeBuilt'],y=scaled_train_y['SalePrice'])
sns.lineplot(ax=axAge[1],x=scaled_all_num_cat_robust_trainX['AgeRemodAdd'],y=scaled_train_y['SalePrice'])
plt.show()
In [184]:
# Generate the same features in validation data
# engineering AgeBuilt
scaled_all_num_cat_robust_valX['AgeBuilt'] = scaled_all_num_cat_robust_valX['YrSold'] - scaled_all_num_cat_robust_valX['YearBuilt']
# engineering AgeRemodAdd
scaled_all_num_cat_robust_valX['AgeRemodAdd'] = scaled_all_num_cat_robust_valX['YrSold'] - scaled_all_num_cat_robust_valX['YearRemodAdd']
In [185]:
# for test dataset
scaled_all_num_cat_robust_test['AgeBuilt'] = scaled_all_num_cat_robust_test["YrSold"] - scaled_all_num_cat_robust_test['YearBuilt']
scaled_all_num_cat_robust_test['AgeRemodAdd'] = scaled_all_num_cat_robust_test['YrSold'] - scaled_all_num_cat_robust_test['YearBuilt']
In [186]:
figAge, axAge = plt.subplots(1,2,figsize=(10,5))
sns.lineplot(ax=axAge[0],x=scaled_all_num_cat_robust_valX['AgeBuilt'],y=scaled_val_y['SalePrice'])
sns.lineplot(ax=axAge[1],x=scaled_all_num_cat_robust_valX['AgeRemodAdd'],y=scaled_val_y['SalePrice'])
plt.show()
In [187]:
# Scaling Age Features
robustScaler_forAge = RobustScaler()
scaled_age_train_X = pd.DataFrame(robustScaler_forAge.fit_transform(scaled_all_num_cat_robust_trainX[['AgeBuilt','AgeRemodAdd']]),
                                      columns=['AgeBuilt','AgeRemodAdd'])
scaled_age_val_X = pd.DataFrame(robustScaler_forAge.transform(scaled_all_num_cat_robust_valX[['AgeBuilt','AgeRemodAdd']]),
                                      columns=['AgeBuilt','AgeRemodAdd'])
In [188]:
# Scaling age features for test dataset
scaled_age_test = pd.DataFrame(robustScaler_forAge.transform(scaled_all_num_cat_robust_test[['AgeBuilt','AgeRemodAdd']]),
                              columns=['AgeBuilt','AgeRemodAdd'])
In [189]:
# Index
scaled_age_train_X.index = scaled_all_num_cat_robust_trainX.index
scaled_age_val_X.index = scaled_all_num_cat_robust_valX.index
scaled_age_train_X
Out[189]:
AgeBuilt AgeRemodAdd
Id
619 -0.717391 -0.388889
871 0.304348 0.916667
93 1.195652 -0.305556
818 -0.586957 -0.222222
303 -0.608696 -0.277778
... ... ...
764 -0.500000 -0.111111
836 0.586957 0.027778
1217 -0.021739 0.500000
560 -0.652174 -0.333333
685 -0.456522 -0.055556

1110 rows × 2 columns

In [190]:
scaled_age_test.index = scaled_all_num_cat_robust_test.index
scaled_age_test
Out[190]:
AgeBuilt AgeRemodAdd
0 0.347826 0.972222
1 0.413043 1.055556
2 -0.434783 -0.027778
3 -0.456522 -0.055556
4 -0.326087 0.111111
... ... ...
1454 0.065217 0.611111
1455 0.065217 0.611111
1456 0.282609 0.888889
1457 -0.413043 0.000000
1458 -0.434783 -0.027778

1459 rows × 2 columns

In [191]:
# drop unscaling AgeBuilt and AgeRemodAdd

scaled_all_num_cat_robust_trainX = scaled_all_num_cat_robust_trainX.drop(['AgeBuilt','AgeRemodAdd'],axis=1) 
scaled_all_num_cat_robust_valX = scaled_all_num_cat_robust_valX.drop(['AgeBuilt','AgeRemodAdd'],axis=1) 
scaled_all_num_cat_robust_trainX
Out[191]:
LotFrontage LotArea MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF 1stFlrSF 2ndFlrSF LowQualFinSF ... x25_ConLI x25_ConLw x25_New x25_WD x26_Abnorml x26_AdjLand x26_Alloca x26_Family x26_Normal x26_Partial
Id
619 0.85 0.531266 2.605187 -0.478172 0.0 2.219648 1.683557 1.397194 0.000000 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
871 -0.65 -0.758354 0.000000 -0.543656 0.0 0.709567 -0.235781 -0.410256 0.000000 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
93 0.35 0.953038 0.000000 0.429059 0.0 -0.544831 -0.273009 -0.274794 0.000000 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
818 0.10 0.928987 0.853026 1.118008 0.0 -0.223938 1.158221 1.128205 0.000000 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
303 2.25 1.040127 0.864553 -0.543656 0.0 1.819820 1.102378 0.841800 0.000000 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
764 0.45 -0.041899 3.878963 1.042974 0.0 -0.671815 0.504654 0.313498 1.484438 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
836 -0.65 0.001139 0.000000 0.059345 0.0 0.247962 0.122027 -0.075472 0.000000 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
1217 -0.25 -0.168481 0.000000 -0.543656 0.0 -0.824539 -2.084798 0.410256 0.790257 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
560 -1.45 -1.620127 0.103746 -0.543656 0.0 1.533248 0.756980 0.872762 0.000000 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
685 -0.75 1.816329 0.172911 -0.543656 0.0 1.226083 0.386763 0.172230 0.871448 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0

1110 rows × 222 columns

In [192]:
# For test dataset
scaled_all_num_cat_robust_test = scaled_all_num_cat_robust_test.drop(['AgeBuilt','AgeRemodAdd'],axis=1) 
In [193]:
# concat the scaled features
scaled_all_num_cat_robust_trainX = pd.concat([scaled_all_num_cat_robust_trainX,scaled_age_train_X],axis=1)
scaled_all_num_cat_robust_valX = pd.concat([scaled_all_num_cat_robust_valX,scaled_age_val_X],axis=1)
scaled_all_num_cat_robust_trainX
Out[193]:
LotFrontage LotArea MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF 1stFlrSF 2ndFlrSF LowQualFinSF ... x25_New x25_WD x26_Abnorml x26_AdjLand x26_Alloca x26_Family x26_Normal x26_Partial AgeBuilt AgeRemodAdd
Id
619 0.85 0.531266 2.605187 -0.478172 0.0 2.219648 1.683557 1.397194 0.000000 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 -0.717391 -0.388889
871 -0.65 -0.758354 0.000000 -0.543656 0.0 0.709567 -0.235781 -0.410256 0.000000 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.304348 0.916667
93 0.35 0.953038 0.000000 0.429059 0.0 -0.544831 -0.273009 -0.274794 0.000000 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.195652 -0.305556
818 0.10 0.928987 0.853026 1.118008 0.0 -0.223938 1.158221 1.128205 0.000000 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 -0.586957 -0.222222
303 2.25 1.040127 0.864553 -0.543656 0.0 1.819820 1.102378 0.841800 0.000000 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 -0.608696 -0.277778
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
764 0.45 -0.041899 3.878963 1.042974 0.0 -0.671815 0.504654 0.313498 1.484438 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 -0.500000 -0.111111
836 -0.65 0.001139 0.000000 0.059345 0.0 0.247962 0.122027 -0.075472 0.000000 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.586957 0.027778
1217 -0.25 -0.168481 0.000000 -0.543656 0.0 -0.824539 -2.084798 0.410256 0.790257 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 -0.021739 0.500000
560 -1.45 -1.620127 0.103746 -0.543656 0.0 1.533248 0.756980 0.872762 0.000000 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 -0.652174 -0.333333
685 -0.75 1.816329 0.172911 -0.543656 0.0 1.226083 0.386763 0.172230 0.871448 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 -0.456522 -0.055556

1110 rows × 224 columns

In [194]:
# For test dataset
scaled_all_num_cat_robust_test = pd.concat([scaled_all_num_cat_robust_test,scaled_age_test],axis=1)
scaled_all_num_cat_robust_test.shape
Out[194]:
(1459, 224)
In [195]:
# Feature Engineering Using KMeans

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=6) # 6 is abritrary, we can do experiment with different cluster
# For trainX
scaled_all_num_cat_robust_trainX['Cluster'] = kmeans.fit_predict(scaled_all_num_cat_robust_trainX)
scaled_all_num_cat_robust_trainX['Cluster'] = scaled_all_num_cat_robust_trainX['Cluster'].astype('category')

# For valX
scaled_all_num_cat_robust_valX['Cluster'] = kmeans.fit_predict(scaled_all_num_cat_robust_valX)
# for test
scaled_all_num_cat_robust_test['Cluster'] = kmeans.fit_predict(scaled_all_num_cat_robust_test)
In [196]:
# Feature Selection
cat_features.append('Cluster')
# Mutual Information

# Mutual Information treat discrete feature differently with continous feature. First, we need to sort them
disc_features = scaled_all_num_cat_robust_trainX.columns.isin(cat_features).tolist()
# in our case, because of standarization, all value of has been become float64. That's why i sort them by using 
# cat_features and num_features i had created before
disc_features
Out[196]:
[False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 True]
In [197]:
# Feature Selection

# Mutual Information

# Mutual Information treat discrete feature differently with continous feature. First, we need to sort them
from sklearn.feature_selection import mutual_info_regression

def make_mi_scores(X, y, discrete_features):
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores
ames_mi_score = make_mi_scores(scaled_all_num_cat_robust_trainX,scaled_train_y['SalePrice'],discrete_features=disc_features)
In [198]:
# Top 10 related features
ames_mi_score.head(10)
Out[198]:
OverallQual    0.541482
GrLivArea      0.486116
YearBuilt      0.403426
TotalBsmtSF    0.402351
GarageCars     0.350771
GarageArea     0.349153
AgeBuilt       0.337192
BsmtQual       0.324694
1stFlrSF       0.318102
GarageYrBlt    0.314362
Name: MI Scores, dtype: float64
In [199]:
# Bottom 10 related features
ames_mi_score.tail(10)
Out[199]:
x11_2.5Fin    0.0
x11_1Story    0.0
x18_GasW      0.0
x10_TwnhsE    0.0
x2_IR3        0.0
x21_Min1      0.0
x9_Norm       0.0
x21_Mod       0.0
x1_Pave       0.0
Cluster       0.0
Name: MI Scores, dtype: float64
In [200]:
# To make comparison easier, here the plot bar of all MI score
plt.figure(figsize=(20,15))
sns.barplot(y=ames_mi_score.index[:50],x=ames_mi_score[:50])
plt.show()
In [201]:
# filter out feature with Mutual Information lower than threshold; abritrary = 0.3

highest_mi_score =ames_mi_score[ames_mi_score > 0.3].index.tolist()
mi_all_num_cat_robust_trainX = scaled_all_num_cat_robust_trainX[highest_mi_score]
mi_all_num_cat_robust_valX = scaled_all_num_cat_robust_valX[highest_mi_score]

mi_all_num_cat_robust_trainX
Out[201]:
OverallQual GrLivArea YearBuilt TotalBsmtSF GarageCars GarageArea AgeBuilt BsmtQual 1stFlrSF GarageYrBlt ExterQual KitchenQual
Id
619 7.0 0.532391 2007 1.683557 1.0 1.437423 -0.717391 0.0 1.397194 2007.0 0.0 2.0
871 3.0 -0.934433 1962 -0.235781 -1.0 -0.872367 0.304348 4.0 -0.410256 1962.0 3.0 3.0
93 3.0 -0.824499 1921 -0.273009 0.0 -0.257745 1.195652 2.0 -0.274794 1921.0 3.0 3.0
818 6.0 0.314095 2002 1.158221 1.0 1.848823 -0.586957 2.0 1.128205 2002.0 2.0 2.0
303 5.0 0.081665 2001 1.102378 1.0 1.779430 -0.608696 2.0 0.841800 2001.0 2.0 2.0
... ... ... ... ... ... ... ... ... ... ... ... ...
764 6.0 1.375736 1999 0.504654 1.0 1.843866 -0.500000 2.0 0.313498 1999.0 2.0 2.0
836 2.0 -0.662740 1950 0.122027 0.0 -0.237918 0.586957 2.0 -0.075472 1996.0 3.0 2.0
1217 4.0 0.648606 1978 -2.084798 0.0 0.272615 -0.021739 3.0 0.410256 1978.0 3.0 3.0
560 5.0 0.106792 2003 0.756980 0.0 -0.317224 -0.652174 2.0 0.872762 2003.0 2.0 2.0
685 5.0 0.549666 1998 0.386763 0.0 0.009913 -0.456522 2.0 0.172230 1998.0 2.0 3.0

1110 rows × 12 columns

In [202]:
# One thing i first notice is : No Location features, ie Neighborhood, include in Top 10 related features.
# I try to visualize it to see if really there is no strong corellation between Neighborhood and SalePrice

sns.boxplot(x='Neighborhood',y='SalePrice',data=ameHouse)
plt.show()
In [203]:
# Based on some research about real estate, Neighborhood proved to played significant role here. But my mutual_information 
# regression said the opposite.
# Here i decide to do a comparison, a model built using this top features based on mutual information without neighborhood 
# and the same features with additional features from all neighborhood componenet.

# but before i do that, i will continue to perform another feature selection procedure.
In [204]:
# Manually statistical test
# Statistics Significance test is slassified based on sample value 
plt.figure(figsize=(15,15))
sns.heatmap(scaled_all_num_cat_robust_trainX.corr())
plt.show()
In [205]:
# Spearman's rank Correlation with response feature
from scipy.stats import spearmanr
rhols = []
pls = []
for col in scaled_all_num_cat_robust_trainX.columns:
    rho, p = spearmanr(scaled_all_num_cat_robust_trainX[col], scaled_train_y['SalePrice'])
    rhols.append(rho)
    pls.append(p)
In [206]:
# Convert it to dataframe
sign = []
for i in pls:
    if i < 0.05:
        sign.append('Significant')
    else:
        sign.append('Not significant')

spear_corr = pd.DataFrame({
    'Feature' : scaled_all_num_cat_robust_trainX.columns,
    'Corr' : rhols,
    'p-value': pls,
    'Significance': sign
})
In [207]:
# show the number of significant feature
len(spear_corr.loc[spear_corr['Significance']=='Significant'])
Out[207]:
146
In [208]:
#145 out of 224 features proved to be statistically significant correlate to SalePrice
#first, let's drop all insignificant features
sig_features = spear_corr.loc[spear_corr['Significance']=='Significant']['Feature'].tolist()
sig_all_num_cat_robust_trainX = scaled_all_num_cat_robust_trainX[sig_features]
sig_all_num_cat_robust_valX = scaled_all_num_cat_robust_valX[sig_features]
In [209]:
# I will use these 145 features as model building component. 
# but, now i am curious, what if, instead use all of them, i just use features with correlation above certain threshold
# For this experiment, i will use absolute mean of correlation rank as threshold
shapiro(spear_corr.loc[spear_corr['Significance']=='Significant']['Corr'])[1]
threshold_corr = spear_corr.loc[spear_corr['Significance']=='Significant']['Corr'].abs().median()

# filter out the features with correlation below median correlation value
sig_high_features = spear_corr[(spear_corr['Corr'] > threshold_corr) & (spear_corr['Significance']=='Significant')]['Feature'].tolist()
sig_high_num_cat_robust_trainX = scaled_all_num_cat_robust_trainX[sig_high_features]
sig_high_num_cat_robust_valX = scaled_all_num_cat_robust_valX[sig_high_features]
In [210]:
sig_high_num_cat_robust_trainX.shape
sig_high_num_cat_robust_valX.shape
Out[210]:
(269, 43)
In [211]:
# Now, i am already have 6 datasets to train and compared: train-val data with highest MI score, train-val data with
# significant spearman correlation, and train-val data highest and significant spearman correlation

Modelling¶

In this section, i'll start to build prediction model using these following algorithms:

  • Random Forest
  • XGBoost

Here i will train and validate the model quality using cross-validation method, then choose the highest value as the best model and predictor

In [212]:
# Function to score model
from sklearn.model_selection import cross_val_score

def score_model(X,y,model):
    score= -1 * cross_val_score(model,X,y,cv=5,scoring="neg_mean_absolute_error")
    score_mean = score.mean()
    return score_mean
In [213]:
# For accuracy scoring purpose, i will concat train and val dataset
# This because cross-validation will shuffle all observation into n-parameters randomly, train-predict the data, 
# and give average score of accurancy from each prediction subsets

# import Random Forest Regressor Module
from sklearn.ensemble import RandomForestRegressor
# import XGBRegressor Module
from xgboost import XGBRegressor

# merge train and validation dataset
forest_model = RandomForestRegressor(n_estimators = 200, random_state=0)
xgboost_model = XGBRegressor(n_estimators =500, learning_rate=0.01, random_state=0)
In [214]:
#baseline dataset
concat_scaled_all_num_cat_robust = pd.concat([scaled_all_num_cat_robust_trainX,scaled_all_num_cat_robust_valX],axis=0)
concat_scaled_all_num_cat_robust.shape
Out[214]:
(1379, 225)
In [215]:
# Highest MI dataset
concat_mi_all_num_cat_robust = pd.concat([mi_all_num_cat_robust_trainX,mi_all_num_cat_robust_valX],axis=0)
concat_mi_all_num_cat_robust.shape
Out[215]:
(1379, 12)
In [216]:
# All Significant Dataset
concat_sig_all_num_cat_robust = pd.concat([sig_all_num_cat_robust_trainX,sig_all_num_cat_robust_valX],axis=0)
concat_sig_all_num_cat_robust.shape
Out[216]:
(1379, 146)
In [217]:
# High Correlated Significant Dataset
concat_sig_high_num_cat_robust = pd.concat([sig_high_num_cat_robust_trainX,sig_high_num_cat_robust_valX],axis=0)
concat_sig_high_num_cat_robust.shape
Out[217]:
(1379, 43)
In [218]:
concat_scaled_y = pd.concat([scaled_train_y,scaled_val_y],axis=0)
concat_scaled_y.shape
Out[218]:
(1379, 1)
In [219]:
print("Mean Absolute Error of baseline dataset in Random Forest Model: {} ".format(score_model(concat_scaled_all_num_cat_robust,
                                                                                      concat_scaled_y['SalePrice'],forest_model)))
Mean Absolute Error of baseline dataset in Random Forest Model: 0.2092848962723398 
In [220]:
print("Mean Absolute Error of baseline dataset in XGBoost Model: {} ".format(score_model(concat_scaled_all_num_cat_robust,
                                                                                      concat_scaled_y['SalePrice'],xgboost_model)))
Mean Absolute Error of baseline dataset in XGBoost Model: 0.1987541564657423 
In [221]:
print("Mean Absolute Error of MI_highest dataset in Random Forest Model: {} ".format(score_model(concat_mi_all_num_cat_robust,
                                                                                      concat_scaled_y['SalePrice'],forest_model)))
Mean Absolute Error of MI_highest dataset in Random Forest Model: 0.23560130661637196 
In [222]:
print("Mean Absolute Error of MI_highest dataset in XGBoost Model: {} ".format(score_model(concat_mi_all_num_cat_robust,
                                                                                      concat_scaled_y['SalePrice'],xgboost_model)))
Mean Absolute Error of MI_highest dataset in XGBoost Model: 0.2330302579651387 
In [223]:
print("Mean Absolute Error of all Significant dataset in Random Forest Model: {} ".format(score_model(concat_sig_all_num_cat_robust,
                                                                                      concat_scaled_y['SalePrice'],forest_model)))
Mean Absolute Error of all Significant dataset in Random Forest Model: 0.20750087976532589 
In [224]:
print("Mean Absolute Error of all Significant dataset in XGBoost Model: {} ".format(score_model(concat_sig_all_num_cat_robust,
                                                                                      concat_scaled_y['SalePrice'],xgboost_model)))
Mean Absolute Error of all Significant dataset in XGBoost Model: 0.19794070810944578 
In [225]:
print("Mean Absolute Error of High Significant dataset in Random Forest Model: {} ".format(score_model(concat_sig_high_num_cat_robust,
                                                                                      concat_scaled_y['SalePrice'],forest_model)))
Mean Absolute Error of High Significant dataset in Random Forest Model: 0.2123408166365283 
In [226]:
print("Mean Absolute Error of High Significant dataset in XGBoost Model: {} ".format(score_model(concat_sig_high_num_cat_robust,
                                                                                      concat_scaled_y['SalePrice'],xgboost_model)))
Mean Absolute Error of High Significant dataset in XGBoost Model: 0.21109059713462117 
In [227]:
# Create test dataset with all significant features only
sig_scaled_all_num_cat_robust_test = scaled_all_num_cat_robust_test[sig_features]
In [228]:
sig_all_num_cat_robust_trainX['Cluster']=sig_all_num_cat_robust_trainX['Cluster'].astype(np.int64)
In [229]:
sig_scaled_all_num_cat_robust_test[['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']] = sig_scaled_all_num_cat_robust_test[['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']].astype(np.int64)
sig_scaled_all_num_cat_robust_test[['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']].dtypes
Out[229]:
YearBuilt       int64
YearRemodAdd    int64
GarageYrBlt     int64
dtype: object
In [230]:
# After assessment, turns out the best model for this case is using all significant dataset, trained by XGBoostRegression model.
# So, i will use this to predict test data
xgboost_model.fit(sig_all_num_cat_robust_trainX,scaled_train_y)
salePrice_predict = xgboost_model.predict(sig_scaled_all_num_cat_robust_test).reshape(-1,1)
In [231]:
salePrice_predict
inversed_salePrice_predict = robustScaler_forPrice.inverse_transform(salePrice_predict)
inversed_salePrice_predict = inversed_salePrice_predict.tolist()
In [232]:
flat_salePrice_predict = sum(inversed_salePrice_predict,[])
#flat_salePrice_predict
In [233]:
#sub_salePrice_prediction = pd.DataFrame({'Id':ameHouse_test['Id'], 'SalePrice':flat_salePrice_predict})
#sub_salePrice_prediction
#sub_salePrice_prediction.to_csv("submission_ames_iowa_house_price_prediction.csv", index=False)